Skip to content

iglasser/DataFrame

Repository files navigation

DataFrame - A Simple C++ DataFrame Library

C++17 License

A lightweight, user-friendly C++ DataFrame library designed for simulation modelers who need pandas/polars-like functionality with a simple API. Built with only the C++ standard library, it provides intuitive data manipulation capabilities for tables with heterogeneous column types.

Overview

This library provides a DataFrame abstraction similar to pandas or polars, specifically designed for C++ simulation modelers who:

  • Are comfortable with Python dataframes but need C++ performance
  • Need to manipulate tabular data in simulation models
  • Want a simple, intuitive API without steep learning curves
  • Require support for custom types alongside standard types
  • Need efficient filtering, aggregation, and data access operations

Key Features

Simple API - Intuitive, pandas-like interface familiar to Python users
Type-Safe - Compile-time and runtime type checking with clear error messages
Zero Dependencies - Uses only C++ standard library (C++17+)
Custom Types - Seamless support for user-defined structs and classes
Efficient - Column-major storage for cache-friendly operations
Flexible - Heterogeneous columns with different types
Practical - Designed for real-world simulation modeling use cases

Quick Start

Basic Usage

#include "dataframe/DataFrame.hpp"

using namespace dataframe;

int main() {
    // Create a DataFrame
    DataFrame df;
    
    // Add columns
    df.add_column<std::string>("name", {"Alice", "Bob", "Charlie"});
    df.add_column<int>("age", {25, 30, 35});
    df.add_column<double>("score", {88.5, 92.0, 78.5});
    
    // Filter data
    auto filtered = df.filter([](const Row& row) {
        return row.get<int>("age") > 25 && row.get<double>("score") >= 80.0;
    });
    
    // Iterate and print
    for (const auto& row : filtered) {
        std::cout << row.get<std::string>("name") << ": " 
                  << row.get<double>("score") << std::endl;
    }
    
    // Column statistics
    auto score_col = df.get_column<double>("score");
    std::cout << "Mean score: " << score_col.mean() << std::endl;
    
    return 0;
}

Output:

Bob: 92.0
Charlie: 78.5
Mean score: 86.3

Core Operations

Creating DataFrames

// Empty DataFrame
DataFrame df;

// With column names
DataFrame df({"name", "age", "score"});

Adding Data

// Add columns
df.add_column<int>("age", {25, 30, 35});
df.add_column<std::string>("city", {"NYC", "LA", "Chicago"});

// Add rows
df.add_row({
    {"name", "Diana"},
    {"age", 28},
    {"score", 95.0}
});

Accessing Data

// Direct element access
int age = df.get<int>(0, "age");
df.set<int>(0, "age", 26);

// Column access
auto age_col = df.get_column<int>("age");
int first_age = age_col[0];

// Row iteration
for (const auto& row : df) {
    std::string name = row.get<std::string>("name");
    int age = row.get<int>("age");
}

Filtering

// Simple filter
auto adults = df.filter([](const Row& row) {
    return row.get<int>("age") >= 18;
});

// Complex conditions
auto result = df.filter([](const Row& row) {
    return row.get<int>("age") > 25 &&
           row.get<double>("score") >= 85.0 &&
           row.get<bool>("active");
});

// Chained filters
auto filtered = df.filter(condition1)
                  .filter(condition2)
                  .filter(condition3);

CSV I/O (v1.2)

#include "dataframe/io/CSV.hpp"

// Write DataFrame to CSV
DataFrame df;
df.add_column<std::string>("name", {"Alice", "Bob", "Charlie"});
df.add_column<int>("age", {25, 30, 35});
df.add_column<double>("score", {88.5, 92.0, 78.5});

CSVWriter::write(df, "output.csv");

// Read CSV with automatic type inference
DataFrame loaded = CSVReader::read("input.csv");

// Custom delimiter (e.g., tab-separated)
CSVOptions opts;
opts.delimiter = '\t';
CSVWriter::write(df, "data.tsv", opts);
DataFrame tsv_data = CSVReader::read("data.tsv", opts);

Type Inference: CSV reader automatically detects types (int → double → bool → string).

Aggregation (v1.1)

// Group by and aggregate
auto summary = df.group_by("department")
                 .agg({
                     {"age", AggFunc::Mean},
                     {"salary", AggFunc::Sum},
                     {"salary", AggFunc::Mean},
                     {"name", AggFunc::Count}
                 });

// Access aggregated results
for (size_t i = 0; i < summary.rows(); ++i) {
    std::cout << summary.get<std::string>(i, "department") << ": "
              << "Total salary: " << summary.get<double>(i, "salary_sum") << ", "
              << "Avg salary: " << summary.get<double>(i, "salary_mean") << ", "
              << "Count: " << summary.get<double>(i, "name_count") << std::endl;
}

// Column statistics
auto col = df.get_column<double>("salary");
double mean = col.mean();
double total = col.sum();
double min_val = col.min();
double max_val = col.max();
double std = col.stddev();
double variance = col.variance();

// Available aggregation functions:
// AggFunc::Count, Sum, Mean, Min, Max, StdDev, Variance, First, Last

Column Operations

// Select columns
auto subset = df.select({"name", "age"});

// Drop columns
auto reduced = df.drop({"temp_column"});

// Rename columns
df.rename({{"old_name", "new_name"}});

// Remove column
df.remove_column("obsolete");

Working with Custom Types

The library seamlessly supports user-defined types:

struct SimulationResult {
    double energy;
    double efficiency;
    std::string status;
};

DataFrame sim_data;
sim_data.add_column<int>("iteration", {1, 2, 3});
sim_data.add_column<SimulationResult>("result", {
    {100.5, 0.85, "converged"},
    {102.3, 0.87, "converged"},
    {98.7, 0.82, "converged"}
});

// Filter based on custom type fields
auto high_efficiency = sim_data.filter([](const Row& row) {
    auto result = row.get<SimulationResult>("result");
    return result.efficiency > 0.85;
});

Performance Characteristics

Operation Complexity Notes
Column access O(1) Direct map lookup
Element access O(1) Vector indexing
Filter O(n) Single pass over rows
Add row O(m) m = number of columns
Add column O(1) amortized Vector push_back
Aggregation O(n) Single pass per aggregation

Typical Performance:

  • 1,000 rows: Instant operations
  • 10,000 rows: Sub-millisecond operations
  • 100,000 rows: Millisecond-range operations
  • 1,000,000 rows: <100ms for most operations

Building

Requirements

  • C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CMake 3.14 or later
  • Standard C++ library

Build Instructions

# Clone the repository
git clone https://github.com/yourusername/dataframe.git
cd dataframe

# Create build directory
mkdir build && cd build

# Configure and build
cmake ..
make

# Run tests
make test

# Install (optional)
sudo make install

Using in Your Project

With CMake

# Find the package
find_package(DataFrame REQUIRED)

# Link to your target
add_executable(your_app main.cpp)
target_link_libraries(your_app DataFrame::DataFrame)

Header-Only Alternative

The library can also be used as header-only by including the headers directly:

#include "dataframe/DataFrame.hpp"

Documentation

Examples

The examples/ directory contains demonstration programs:

  • basic_usage.cpp - Creating DataFrames, adding data, basic operations
  • example_advanced.cpp - Advanced features including aggregation and statistics
  • example_io.cpp - CSV I/O operations and data persistence

Build and run examples:

# Using Makefile
make basic_usage
./build/basic_usage

make advanced_usage
./build/advanced_usage

make io_example
./build/io_example

Design Principles

  1. Simplicity First - API should be intuitive and easy to learn
  2. Type Safety - Leverage C++ type system while allowing flexibility
  3. Performance Balance - Good performance without premature optimization
  4. Zero Dependencies - Only C++ standard library
  5. User-Friendly Errors - Clear error messages with context

Limitations

  • Not thread-safe - Use external synchronization for concurrent access
  • No null values - Current version doesn't support missing data
  • No joins - Table joining not implemented yet
  • CSV bool limitation - Bool values are stored as 0/1 and inferred as int when reading

Roadmap

Version 1.0 ✅

  • ✅ Core DataFrame operations
  • ✅ Filtering and iteration
  • ✅ Column statistics
  • ✅ Custom type support

Version 1.1 (Current) ✅

  • ✅ GroupBy operations
  • ✅ Aggregation functions (9 total)
  • ✅ Multiple aggregations per column
  • ✅ Advanced examples

Version 1.2 (Current) ✅

  • ✅ CSV import/export with automatic type inference
  • ✅ Dynamic row addition (add_row)
  • ✅ Custom CSV options (delimiter, header)
  • Sorting operations (sort_by)
  • Join operations (merge DataFrames)
  • Null/missing value support

Version 1.3 (Planned)

  • Sorting operations (sort_by)
  • Join operations (merge DataFrames)
  • Null/missing value support
  • Parquet format support

Version 2.0 (Future)

  • Parallel filtering and aggregation
  • Lazy evaluation
  • Window functions
  • SIMD optimizations

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Areas where we'd especially welcome contributions:

  • Additional aggregation functions
  • Performance optimizations
  • More comprehensive tests
  • Example use cases
  • Documentation improvements

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Acknowledgments

  • Inspired by pandas (Python) and polars (Rust)
  • Designed for the simulation modeling community
  • Thanks to all contributors and testers

Support


Ready to get started? Check out the Quick Start section above or explore the examples directory!

About

DataFrame prototype in C++

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors