A production-grade machine learning pipeline for single-cell RNA-seq data analysis with clone-level classification.
This project uses uv for fast Python package management.
- Python 3.11 or later
- uv installed
Device Support: TRACE automatically detects and utilizes the best available device for XGBoost training:
- CUDA/GPU: Automatically used when available (highest priority)
- Apple Silicon: Optimized for Apple Silicon Macs (CPU training for optimal compatibility)
- CPU: Fallback for all systems
# Install dependencies
uv sync
# Activate the virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Verify installation
trace --helpTRACE requires the following production dependencies:
click>=8.0.0- CLI frameworkpyyaml>=6.0- Configuration file parsingpydantic>=2.0.0- Data validationpydantic-settings>=2.0.0- Settings managementpandas>=2.0.0- Data manipulationnumpy>=1.20.0- Numerical computingscikit-learn>=1.6.0,<2.0- Machine learning utilitiesxgboost>=1.7.0- Gradient boosting frameworkscanpy>=1.9.0- Single-cell analysisanndata>=0.9.0- Annotated data structuresscipy>=1.10.0- Scientific computingmatplotlib>=3.7.0- Plottingseaborn>=0.12.0- Statistical visualizationscikit-misc>=0.2.0- Additional scikit-learn utilitiespsutil>=5.9.0- System utilitiesoptuna>=3.5.0- Hyperparameter optimizationboto3>=1.26.0- AWS SDKmlflow>=2.10.0- ML experiment trackingpsycopg2-binary>=2.9.0- PostgreSQL adapterpython-dotenv>=1.0.0- Environment variable managementrequests>=2.31.0- HTTP libraryshap>=0.43.0- SHAP values for model interpretability
TRACE provides a comprehensive CLI for all machine learning operations. Here are the essential commands organized by workflow phase:
TRACE supports two distinct preprocessing workflows depending on your analysis needs:
For analyses that require individual cell-level data without clone summarization:
# Preprocess at single-cell level (no clone summarization)
trace preprocess data.h5ad --skip-clone-summarization
# Custom single-cell preprocessing
trace preprocess data.h5ad \
--skip-clone-summarization \
--gene-selection-method all \
--n-top-genes 500 \
--min-genes-per-cell 100 \
--cell-id-column barcode \
--label-column trt_label \
--output-file single_cell_data.h5For analyses that require clone-level expression summaries:
# Preprocess with clone summarization (default)
trace preprocess data.h5ad
# Custom clone-level preprocessing
trace preprocess data.h5ad \
--gene-selection-method highly_variable \
--n-top-genes 500 \
--summary-method percentile \
--percentile 75.0 \
--min-cells-per-clone 1 \
--clone-id-column clone_ID \
--label-column trt_label \
--output-file clone_data.h5TRACE provides several pre-configured files for different analysis types:
# Use pre-configured preprocessing files
trace --config config/config_preprocess_sc.yaml preprocess data.h5ad # Single-cell
trace --config config/config_preprocess_clone.yaml preprocess data.h5ad # Clone-level
# Use pre-configured training file
trace --config config/config_train_evaluate.yaml train processed_data.h5Single-Cell Configuration (config/config_preprocess_sc.yaml):
preprocessing:
gene_selection_method: all
min_cells_per_clone: 1
min_genes_per_cell: 100
n_top_genes: 500
percentile: 75.0
quality_threshold: 1 # No filtering
summary_method: percentile
clone_id_column: clone_ID
fallback_to_default_clone_ids: false
label_column: trt_label
fallback_to_default_labels: true
skip_clone_summarization: true # Key differenceClone-Level Configuration (config/config_preprocess_clone.yaml):
preprocessing:
gene_selection_method: all
min_cells_per_clone: 1
min_genes_per_cell: 100
n_top_genes: 500
percentile: 75.0
quality_threshold: 1 # No filtering
summary_method: percentile
clone_id_column: clone_ID
fallback_to_default_clone_ids: false
label_column: trt_label
fallback_to_default_labels: true
skip_clone_summarization: false # Key difference# Generate comprehensive quality assessment report
trace preprocess data.h5ad --quality-report
# Use configuration file with quality reporting
trace --config config/config_preprocess_sc.yaml preprocess data.h5ad --quality-report# Create default configuration template
trace init-config --output-file my_config.yaml
# Create batch processing configuration template
trace init-config --config-type batch --output-file batch_config.yaml
# Validate configuration file
trace validate-config my_config.yaml
# Use configuration file
trace --config config/tracepy_default.yaml preprocess data.h5adAvailable Configuration Files:
config/config_preprocess_sc.yaml- Single-cell preprocessing configurationconfig/config_preprocess_clone.yaml- Clone-level preprocessing configurationconfig/config_train_evaluate.yaml- Default training and evaluation configuration
# Train XGBoost model with default settings
trace train processed_data.h5
# Train specific algorithm
trace train processed_data.h5 --algorithm randomforest
# Train with custom hyperparameters
trace train processed_data.h5 \
--algorithm xgboost \
--n-estimators 200 \
--max-depth 8 \
--learning-rate 0.05 \
--cv-folds 10# Enable hyperparameter tuning with random search
trace train processed_data.h5 \
--hyperparameter-tuning \
--n-iter 50
# Use Optuna Bayesian optimization
trace train processed_data.h5 \
--hyperparameter-tuning \
--tuning-method optuna \
--optuna-trials 100 \
--optuna-sampler tpe
# Grid search
trace train processed_data.h5 \
--hyperparameter-tuning \
--tuning-method grid_search# Evaluate trained model
trace evaluate model_xgboost_42.joblib processed_data.h5
# Generate evaluation plots
trace evaluate model_xgboost_42.joblib processed_data.h5 --generate-plots
# Custom output report
trace evaluate model_xgboost_42.joblib processed_data.h5 \
--output-report evaluation_report.json# Compare models in directory
trace compare ./models/
# Compare with specific data
trace compare ./models/ --data-file processed_data.h5
# Use specific primary metric
trace compare ./models/ --primary-metric accuracy
# Custom comparison report
trace compare ./models/ --output-comparison comparison_report.json# Extract metrics from parameter sweep directory (nested structure)
trace extract-metrics output_121925
# Extract metrics from single training run directory
trace extract-metrics output/eb10/training/seed_101_optuna_percentile_25_topngenes_50
# Specify custom output CSV file
trace extract-metrics output_121925 --output-file metrics_consolidated.csvWhat extract-metrics Does:
- Automatically detects directory structure (parameter sweep vs single training run)
- Scans for
test_results_*.jsonfiles in training directories - Extracts all model performance metrics (accuracy, precision, recall, F1, ROC-AUC, PR-AUC, etc.)
- Parses metadata from folder names (seed, method, percentile, topngenes, run_id)
- Generates a consolidated CSV file with all metrics and metadata
Supported Directory Structures:
Parameter Sweep Structure:
output_dir/
├── run_type1/ (e.g., eb10bal, eb20bal, lognorm)
│ └── training/
│ └── seed_{seed}_{method}_percentile_{percentile}_topngenes_{topngenes}/
│ └── test_results_*.json
├── run_type2/
│ └── training/
│ └── ...
Single Training Run Structure:
output_dir/
├── test_results_*.json
OR
output_dir/
├── training/
│ └── test_results_*.json
OR
output_dir/
└── seed_{seed}_{method}_percentile_{percentile}_topngenes_{topngenes}/
└── test_results_*.json
Output CSV Columns:
- Metadata:
model_id,run_id,seed,method,percentile,topngenes,algorithm,optimal_threshold,n_selected_genes - Test metrics: All metrics from
test_metrics(prefixed withtest_) - Test data info: All fields from
test_data_info(prefixed withtest_data_)
### Phase 4: Model Retraining
#### Optimal Model Selection and Retraining
```bash
# Retrain model on full dataset using optimal hyperparameters from CV
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5
# Retrain with hyperparameter retuning on full dataset
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
--retune-hyperparameters
# Override class balancing for retraining
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
--balance-classes "yes:no_15:85"
# Use best hyperparameters from CV results instead of retuning
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
--use-best-hyperparameters
Important Notes for Retraining:
- Optimal Model Selection: Choose the best performing model from cross-validation results based on your primary metric (e.g., F1-score, PR-AUC)
- Hyperparameter Strategy: Use
--retune-hyperparametersto optimize hyperparameters on the full dataset,--use-best-hyperparametersto use the best hyperparameters from CV results, or omit to use the most stable hyperparameters from CV - Class Balancing: Ensure consistent class balancing between training and retraining using the same
--balance-classesspecification - Gene Selection: The retrained model uses the same gene selection from the original CV training for consistency
Required Step: Before making predictions, new data must be preprocessed using preprocess-with-model to ensure proper feature alignment with the trained model. The predict command will enforce this requirement.
# Preprocess new .h5ad data using a trained model
trace preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
--output-file new_data_aligned.h5
# Preprocess 10x Genomics HDF5 file (automatically detected)
trace preprocess-with-model model_xgboost_42.joblib \
filtered_feature_bc_matrix.h5 \
--output-file new_data_aligned.h5
# Preprocess from HTTPS URL
trace preprocess-with-model model_xgboost_42.joblib \
https://example.com/data.h5ad \
--output-file new_data_aligned.h5
# Preprocess already-processed .h5 data (re-aligns to model's training genes)
trace preprocess-with-model model_xgboost_42.joblib new_data_processed.h5 \
--output-file new_data_aligned.h5
# Use configuration file for clone ID/label column settings
trace --config config/config_preprocess_clone.yaml \
preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
--output-file new_data_aligned.h5What preprocess-with-model Does:
- Aligns input features to the model's training gene set (pads missing genes with zeros, ignores extra genes)
- Applies the same preprocessing transformations used during training (e.g., expression binning with the same number of bins)
- Ensures gene order matches the training data exactly
- Creates provenance metadata that
predictuses to verify data compatibility
Key Features:
- Accepts both
.h5ad(raw) and.h5(preprocessed) input files - Supports 10x Genomics HDF5 files (
.h5withmatrixgroup) - automatically detected and loaded - Supports HTTPS URLs for remote data files (automatically downloaded)
- Always outputs
.h5format aligned to the model's training genes - Automatically applies expression binning if the model was trained with binning
- Uses model metadata to ensure exact feature alignment
Important: Data must be preprocessed with preprocess-with-model before prediction. The predict command will verify this automatically.
# Step 1: Preprocess new data with the model
trace preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
--output-file new_data_aligned.h5
# Step 2: Make predictions on the aligned data
trace predict model_xgboost_42.joblib new_data_aligned.h5
# Include confidence scores
trace predict model_xgboost_42.joblib new_data_aligned.h5 --confidence-scores
# Custom output file
trace predict model_xgboost_42.joblib new_data_aligned.h5 \
--output-predictions predictions.csvTRACE supports processing large files from various sources using memory-efficient chunked processing:
# Predict on S3 file with chunked processing
trace predict model_xgboost_42.joblib s3://bucket/data.h5ad \
--preprocess-with-model \
--chunk-size 10000 --output-predictions predictions.csv
# Predict on HTTPS URL (10x Genomics example)
trace predict model_xgboost_42.joblib \
https://cf.10xgenomics.com/samples/cell-exp/4.0.0/SC3_v3_NextGem_SI_PBMC_10K/SC3_v3_NextGem_SI_PBMC_10K_filtered_feature_bc_matrix.h5 \
--preprocess-with-model \
--analysis-mode
# Local file with chunked processing
trace predict model_xgboost_42.joblib /path/to/large_data.h5ad \
--preprocess-with-model \
--chunk-size 10000 --output-predictions predictions.csv10x Genomics Support:
- Automatically detects 10x Genomics HDF5 format by checking for
matrixgroup - Loads using scanpy's
read_10x_h5()function - Processes through the same preprocessing pipeline as
.h5adfiles - Works with both local files and HTTPS URLs
Chunked Processing:
- Default chunk size: 10,000 cells
- Configurable via
--chunk-sizeoption - Memory usage stays constant regardless of file size
- Automatically enabled for S3 paths, HTTPS URLs, or large local files
Supported Input Formats:
.h5adfiles (AnnData format) - local, S3, or HTTPS URLs.h5files (TRACE preprocessed HDF5) - local, S3, or HTTPS URLs- 10x Genomics HDF5 files (
.h5withmatrixgroup) - automatically detected
Remote File Configuration:
- S3: Set AWS credentials via environment variables:
AWS_ACCESS_KEY_ID: Your AWS access keyAWS_SECRET_ACCESS_KEY: Your AWS secret keyAWS_SESSION_TOKEN: Optional session token (for temporary credentials)AWS_REGIONorAWS_DEFAULT_REGION: AWS region (default: us-east-1)
- HTTPS URLs: Automatically downloaded to temporary files (no configuration needed)
Apply predictions only to cells of interest, with NA output for non-target cells:
# Filter cells by obs column value
trace predict model_xgboost_42.joblib data.h5ad \
--preprocess-with-model \
--target-cells cell_type:T_cell \
--output-predictions predictions.csvTrain a model to classify cells as target vs non-target before prediction:
# Train cell classification model
trace classify-cells data.h5 cell_type T_cell \
--output-model cell_classifier.joblib \
--algorithm xgboost
# Use cell classifier in prediction
trace predict model_xgboost_42.joblib data.h5ad \
--preprocess-with-model \
--cell-classifier cell_classifier.joblib \
--output-predictions predictions.csv# Set log level
trace --log-level DEBUG preprocess data.h5ad
# Use configuration file
trace --config my_config.yaml train processed_data.h5
# Specify output directory
trace --output-dir ./results preprocess data.h5ad# 1. Create configuration
trace init-config --output-file experiment_config.yaml
# 2. Preprocess data
trace --config config/tracepy_default.yaml preprocess data/experiment.h5ad
# 3. Train model with hyperparameter tuning
trace --config config/tracepy_default.yaml train data/processed/experiment_processed.h5 \
--algorithm xgboost --hyperparameter-tuning --n-iter 50
# 2. Preprocess data (choose single-cell or clone-level)
# For single-cell analysis:
trace --config config/config_preprocess_sc.yaml preprocess data/experiment.h5ad
# For clone-level analysis:
trace --config config/config_preprocess_clone.yaml preprocess data/experiment.h5ad
# 3. Train multiple models with hyperparameter tuning using default config
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
--algorithm xgboost --hyperparameter-tuning --n-iter 50
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
--algorithm randomforest --hyperparameter-tuning --n-iter 30
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
--algorithm adaboost --hyperparameter-tuning --n-iter 30
# 4. Compare models to select optimal model
trace compare models/ --data-file data/processed/experiment_processed.h5
# 4b. Extract metrics from all training runs (optional, for analysis)
trace extract-metrics output/ --output-file training_metrics_consolidated.csv
# 5. Retrain optimal model on full dataset
trace retrain models/best_model.joblib models/best_model_gene_ranks.json \
data/processed/experiment_processed.h5 --retune-hyperparameters
# 6. Preprocess new data using the retrained model (recommended approach)
# This ensures proper feature alignment with the model's training genes
trace preprocess-with-model models/retrained_model.joblib data/new_experiment.h5ad \
--output-file data/processed/new_experiment_aligned.h5
# Alternative: If new data is already preprocessed, re-align it to the model
trace preprocess-with-model models/retrained_model.joblib data/processed/new_experiment_processed.h5 \
--output-file data/processed/new_experiment_aligned.h5
# 7. Make predictions with retrained model
trace predict models/retrained_model.joblib data/processed/new_experiment_aligned.h5 \
--confidence-scoresTRACE/
├── src/tracepy/ # Main package source code
│ ├── preprocessing/ # Phase 1: Data preprocessing
│ ├── models/ # Phase 2: Model training
│ ├── evaluation/ # Phase 3: Evaluation & metrics
│ ├── cli/ # CLI interface and commands
│ └── utils/ # Utility functions
├── config/ # Configuration files
└── README.md # This file
This project is licensed under the MIT License - see the LICENSE file for details.