Transform machine learning pipelines from code into conversation.
PipelineScript is a revolutionary Domain-Specific Language (DSL) that makes machine learning pipelines readable, debuggable, and accessible to everyone. No more nested code, complex APIs, or cryptic configurations.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# Load data
data = pd.read_csv('data.csv')
# Clean
data = data.dropna()
# Encode categoricals
from sklearn.preprocessing import LabelEncoder
for col in data.select_dtypes(['object']).columns:
data[col] = LabelEncoder().fit_transform(data[col])
# Split
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train
model = XGBClassifier()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
# Export
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)load data.csv
clean missing
encode
split 80/20 --target target
scale
train xgboost
evaluate
export model.pkl
That's it. Same functionality, 90% less code, infinitely more readable.
Write ML pipelines like you'd describe them to a colleague:
load sales.csv
filter revenue > 1000
clean outliers
split 75/25 --target revenue
train xgboost
evaluate
Step through your pipeline like a regular program:
from pipelinescript import debug
debug("""
load data.csv
clean missing
train xgboost
""")Debugger commands:
step- Execute next stepbreak 3- Set breakpoint at step 3context- Show current data and modelinspect model- Inspect specific variablecontinue- Run until completion
Automatically visualize your pipeline structure:
from pipelinescript import run
run(script, visualize=True)Generates ASCII or graphical pipeline diagrams showing data flow.
Prefer Python? Use the fluent API:
from pipelinescript import Pipeline
result = (Pipeline()
.load("data.csv")
.clean_missing()
.encode()
.split(0.8, target="label")
.train("xgboost")
.evaluate()
.export("model.pkl")
.run())Pre-built pipelines for common tasks:
from pipelinescript.pipeline import quick_classification
# One line for complete classification pipeline
result = quick_classification("data.csv", "label", "xgboost")pip install pipelinescriptOptional dependencies:
# For XGBoost models
pip install xgboost
# For visualization
pip install matplotlib
# For all features
pip install pipelinescript[full]my_pipeline.psl:
load iris.csv
clean missing
encode
split 80/20 --target species
train random_forest
evaluate
export iris_model.pkl
Command Line:
pipelinescript run my_pipeline.pslPython:
from pipelinescript import run
result = run("my_pipeline.psl")
if result.success:
print(f"✅ Accuracy: {result.context.metrics['accuracy']:.4f}")That's it! Your model is trained, evaluated, and exported.
load <filepath> # Load data from file
Supported formats: CSV, Excel, JSON, Parquet
clean missing # Remove rows with missing values
clean duplicates # Remove duplicate rows
clean outliers # Remove statistical outliers (IQR method)
encode # Encode categorical variables
scale # Scale numeric features (StandardScaler)
filter <condition> # Filter rows (e.g., "age > 18")
select <col1> <col2> ... # Select specific columns
split 80/20 # Split data 80% train, 20% test
split 0.8 --target label # Split with specific target column
split 75/25 --target price # Custom ratio with target
train xgboost # XGBoost (requires xgboost package)
train random_forest # Random Forest
train logistic # Logistic Regression
train linear # Linear Regression
train auto # Auto-select based on task
predict # Make predictions on test set
evaluate # Compute evaluation metrics
export model.pkl # Save model to file
save model.pkl # Alias for export
import model.pkl # Load model from file
Options use --flag or -f syntax:
split 80/20 --target revenue
train xgboost --n_estimators 100
Use # for comments:
# Load and prepare data
load data.csv
clean missing # Remove nulls
# Train model
train xgboost
load titanic.csv
clean missing
encode
split 80/20 --target survived
train random_forest
evaluate
export titanic_model.pkl
load housing.csv
clean outliers
select bedrooms bathrooms sqft price
scale
split 75/25 --target price
train linear
evaluate
load sales.csv
filter revenue > 1000
select date product revenue region
clean missing
encode
split 80/20 --target revenue
train xgboost
evaluate
export sales_model.pkl
from pipelinescript import debug
script = """
load data.csv
clean missing
split 80/20 --target label
train xgboost
evaluate
"""
result = debug(script)
# In debugger:
# (pdb) step # Execute next step
# (pdb) context # Show current state
# (pdb) inspect model # Look at model
# (pdb) continue # Run to completionfrom pipelinescript import Pipeline
# Method chaining
pipeline = (Pipeline()
.load("data.csv")
.clean_missing()
.clean_outliers()
.encode()
.scale()
.split(0.8, target="label")
.train_xgboost()
.evaluate()
.export("model.pkl")
)
# Execute
result = pipeline.run()
# Show results
if result.success:
print(f"Duration: {result.duration:.2f}s")
print(f"Metrics: {result.context.metrics}")from pipelinescript.pipeline import (
quick_classification,
quick_regression,
quick_train
)
# Classification in one line
result = quick_classification("iris.csv", "species", "xgboost")
# Regression in one line
result = quick_regression("housing.csv", "price", "random_forest")
# Train and export in one line
result = quick_train("data.csv", "target", "model.pkl")from pipelinescript import run
run(script, visualize=True)Output:
════════════════════════════════════════════════
📊 PIPELINE VISUALIZATION
════════════════════════════════════════════════
START
│
▼
┌─────────────┐
│ LOAD data.csv │
└─────────────┘
│
▼
┌──────────────┐
│ CLEAN missing │
└──────────────┘
│
▼
┌──────────────┐
│ TRAIN xgboost │
└──────────────┘
│
▼
END
from pipelinescript import parse
from pipelinescript.visualizer import PipelineVisualizer
ast = parse(script)
visualizer = PipelineVisualizer()
visualizer.visualize_pipeline(ast, save_path="pipeline.png")Generates a beautiful flowchart visualization.
PipelineScript includes a powerful interactive debugger inspired by Python's pdb:
from pipelinescript import debug
debug("""
load data.csv
clean missing
split 80/20 --target label
train xgboost
evaluate
""")| Command | Alias | Description |
|---|---|---|
run |
r |
Run until completion/breakpoint |
step |
s, next, n |
Execute next step |
continue |
c, cont |
Continue execution |
break <n> |
b |
Set breakpoint at step n |
clear <n> |
Clear breakpoint | |
list |
l, ls |
List all steps |
context |
ctx, vars |
Show execution context |
inspect <var> |
i, p |
Inspect variable |
restart |
Restart from beginning | |
quit |
q, exit |
Quit debugger |
(pdb) list
Pipeline Steps:
══════════════════════════════════════════════
→ 1. load
2. clean
3. split
4. train
5. evaluate
══════════════════════════════════════════════
(pdb) break 4
🔴 Breakpoint set at step 4
(pdb) run
▶️ Step 1: load
Loaded 150 rows from iris.csv
▶️ Step 2: clean
Removed 0 rows with missing values
▶️ Step 3: split
Split data: 120 train, 30 test (80/20)
🔴 Breakpoint at step 4
(pdb) context
📊 Execution Context:
══════════════════════════════════════════════
data: DataFrame (150, 5)
columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
X_train: (120, 4)
X_test: (30, 4)
Recent log entries:
• Loaded 150 rows from iris.csv
• Removed 0 rows with missing values
• Split data: 120 train, 30 test (80/20)
══════════════════════════════════════════════
(pdb) step
▶️ Step 4: train
Trained XGBClassifier
(pdb) inspect model
model: XGBClassifier
Value: XGBClassifier(...)
(pdb) continue
▶️ Step 5: evaluate
Accuracy: 0.9667
✅ Pipeline execution completed!
PipelineScript consists of five core components:
┌─────────────────────────────────────────────┐
│ PipelineScript Engine │
├─────────────────────────────────────────────┤
│ │
│ 1. Parser → Lexical analysis & AST │
│ 2. Compiler → AST to executable steps │
│ 3. Executor → Step execution engine │
│ 4. Debugger → Interactive debugging │
│ 5. Visualizer → Pipeline visualization │
│ │
└─────────────────────────────────────────────┘
- Lexical analysis (tokenization)
- Syntax parsing
- AST generation
- Compiles AST into executable steps
- Integrates with sklearn, xgboost
- Handles data transformations
- Executes compiled steps
- Manages execution context
- Handles errors and logging
- Interactive step-through execution
- Breakpoints and inspection
- Context visualization
- ASCII pipeline diagrams
- Graphical visualizations
- DAG export
Test different models and preprocessing strategies in minutes:
load data.csv
clean missing
split 80/20 --target label
train xgboost
evaluate
Perfect for teaching ML concepts without drowning in code:
# Clear, readable steps students can understand
load iris.csv
split 70/30 --target species
train random_forest
evaluate
Pipeline scripts are version-controllable and self-documenting:
# research_pipeline.psl
load experiment_data.csv
clean outliers
split 80/20 --target outcome
train xgboost
evaluate
Easily generate and test multiple pipelines programmatically:
models = ['xgboost', 'random_forest', 'logistic']
for model in models:
pipeline = Pipeline().load("data.csv").clean_missing()
pipeline.split(0.8, target="label").train(model).evaluate()
result = pipeline.run()
print(f"{model}: {result.context.metrics['accuracy']}")Export trained pipelines as standalone Python scripts or containers.
from pipelinescript import Pipeline
pipeline = Pipeline()
pipeline.load("data.csv")
# Custom filtering
pipeline.filter("age > 18 and income < 100000")
# Select features
pipeline.select("age", "income", "education")
# Continue pipeline
pipeline.clean_missing().encode().scale()
pipeline.split(0.8, target="default").train("xgboost")
result = pipeline.run()result = pipeline.run()
if result.success:
# Access data
print(result.context.data.head())
# Access model
model = result.context.model
# Access metrics
print(result.context.metrics)
# Access predictions
predictions = result.context.predictions
# Access log
for entry in result.context.log:
print(entry)Add custom commands by extending the compiler:
from pipelinescript.compiler import PipelineCompiler
from pipelinescript.parser import ASTNode
class CustomCompiler(PipelineCompiler):
def __init__(self):
super().__init__()
self.commands['my_command'] = self._compile_my_command
def _compile_my_command(self, node: ASTNode):
def custom_step(context):
# Your custom logic
return context
return CompiledStep('my_command', custom_step, [], {}, node.line)- v0.2.0: GPU support (RAPIDS, cuML)
- v0.3.0: Deep learning models (PyTorch, TensorFlow)
- v0.4.0: AutoML integration
- v0.5.0: Distributed training (Ray, Dask)
- v0.6.0: Model serving integration
- v0.7.0: Pipeline scheduling and monitoring
- v1.0.0: Production-ready feature complete
Contributions welcome! Areas needing help:
- Additional model types (SVM, KNN, etc.)
- More preprocessing options
- Better visualizations
- Documentation improvements
- Test coverage
See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE file.
PipelineScript was inspired by:
- SQL's declarative simplicity
- UNIX pipes' composability
- scikit-learn's consistent API
- The need for ML democratization
| Feature | PipelineScript | Sklearn | Keras | MLflow |
|---|---|---|---|---|
| Human-readable syntax | ✅ | ❌ | ❌ | ❌ |
| Interactive debugging | ✅ | ❌ | ❌ | ❌ |
| Built-in visualization | ✅ | ❌ | ✅ | ✅ |
| One-line pipelines | ✅ | ❌ | ❌ | ❌ |
| No code required | ✅ | ❌ | ❌ | ❌ |
| Production ready | 🚧 | ✅ | ✅ | ✅ |
See the examples/ directory for:
simple_classification.psl- Basic classificationxgboost_pipeline.psl- XGBoost exampleregression.psl- Regression pipelinepython_examples.py- Python API examplesiris.csv- Sample dataset
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: idrissbadoolivier@gmail.com
If you find PipelineScript useful, please star the repo! ⭐
🔥 Built with ❤️ by Idriss Bado
Making machine learning pipelines human again.