-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish I could use pandas to track changes and maintain version history of DataFrames during data analysis workflows with Git-like semantics.
Currently, when working with DataFrames in data science projects, there's no built-in way to:
- Save snapshots of DataFrame states before risky transformations
- Roll back to previous versions when experiments fail
- Create experimental branches for testing different feature engineering approaches
- Track what changes were made with commit messages and timestamps
- Compare different versions to see exactly what changed between transformations
- Export and import complete analysis history for reproducibility
This leads to verbose manual backup code (df_backup = df.copy()
), lost work when transformations go wrong, and difficulty collaborating on data analysis where team members need to understand the evolution of datasets.
Feature Description
I propose adding a DataFrame versioning extension that provides Git-like version control capabilities directly integrated into pandas DataFrames through the accessor pattern.
Core API using pandas accessor:
import pandas as pd
# Create DataFrame and enable versioning
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
})
# Enable versioning with optional delta storage for efficiency
df.enable_versioning(use_delta_storage=True)
# Make changes and commit with messages
df['salary'] = df['salary'] * 1.1
df.commit("Increased salaries by 10%")
# Oops, that was wrong - rollback!
df.rollback() # DataFrame reverts to previous state
# View complete history with timestamps and messages
print(df.version.history())
Advanced branching and merging:
# Create experimental branch
df.version.create_branch("experiment")
df.version.switch_branch("experiment")
# Make experimental changes
df['bonus'] = df['salary'] * 0.1
df.commit("Added experimental bonus calculation")
# Switch back to main and merge if satisfied
df.version.switch_branch("main")
df.version.merge_branch("experiment")
Diff visualization and comparison:
# See what changed between versions
diff = df.version.diff_with(version=0)
diff.show() # Formatted console output
diff.to_html("analysis_changes.html") # Generate HTML report
# Get summary of changes
print(diff.summary())
Export/Import for reproducibility:
# Export complete history for backup/sharing
df.version.export_history("analysis_v1.pkl")
df.version.export_history("analysis_v1.json") # Multiple formats
df.version.export_history("analysis_v1.pkl.gz") # With compression
# Import history later or on different machine
df.version.import_history("analysis_v1.pkl")
# Check storage efficiency
stats = df.version.get_storage_stats()
print(f"Space saved with delta storage: {stats['space_saved_percentage']:.1f}%")
Implementation details:
- Uses
@register_dataframe_accessor("version")
for clean integration - Efficient delta storage system (saves 20-80% memory vs full snapshots)
- Support for both accessor style (
df.version.commit()
) and direct methods (df.commit()
) - Thread-safe version management with configurable limits
- Multiple serialization formats (pickle, JSON, dill) with compression support
- Comprehensive error handling and validation
Alternative Solutions
Current workarounds:
-
Manual DataFrame copying:
df_v1 = df.copy() # Memory intensive, no metadata df_v2 = df.copy() # Becomes unmanageable quickly
-
External file-based versioning:
df.to_pickle(f'data_v{version}.pkl') # Manual, error-prone
-
Custom tracking dictionaries:
versions = {'v1': df.copy(), 'v2': df.copy()} # Memory explosion
3rd party packages:
- DVC (Data Version Control): File-based, requires external commands, not integrated with DataFrame operations
- MLflow: Focused on ML experiments, not interactive data analysis workflows
- Pachyderm: Heavyweight, container-based, overkill for DataFrame versioning
- Git LFS: Requires external files, complex setup, no DataFrame-native operations
None provide the seamless, DataFrame-native experience with efficient memory usage that this extension would offer.
Additional Context
Complete working implementation available:
I have developed a fully functional prototype demonstrating this capability with the following features:
- Comprehensive test suite: 67+ tests covering all functionality
- Performance optimizations: Delta storage reduces memory usage by 20-80%
- Production-ready code: Error handling, type hints, documentation
- Multiple storage backends: Full snapshots or efficient delta-only storage
- Export formats: Pickle, JSON, Dill with optional compression
- Professional documentation: Complete API reference and examples
Repository structure:
pandas-versioning/
├── pandas_versioning/
│ ├── core/
│ │ ├── pandas_extension.py # Main accessor implementation
│ │ ├── version_manager.py # Core versioning logic
│ │ ├── delta_storage.py # Efficient delta storage
│ │ ├── diff_engine.py # Change visualization
│ │ └── serializer.py # Export/import functionality
├── tests/ # Comprehensive test suite
├── examples/ # Usage examples
└── docs/ # Documentation
Performance benchmarks (from testing):
- Delta storage: 65% memory reduction on typical datasets
- Commit operations: <50ms for DataFrames up to 100K rows
- Branch operations: O(1) time complexity
- Export/import: Supports datasets up to several GB
Real-world use cases proven:
- Data Science Workflows: Safe experimentation with feature engineering
- Collaborative Analysis: Team members can track and understand changes
- Production Pipelines: Automated rollback when data validation fails
- Educational: Clear demonstration of data transformation steps
Integration approach:
The implementation uses pandas' official extension mechanism and follows all pandas conventions:
- Follows pandas accessor patterns
- Maintains DataFrame immutability where appropriate
- Compatible with pandas 1.3+ and Python 3.8+
- No breaking changes to existing pandas functionality
- Clean, documented API following pandas style
This feature would significantly enhance pandas' capabilities for interactive data analysis while maintaining full backward compatibility and following established pandas design patterns.