TRACE

A production-grade machine learning pipeline for single-cell RNA-seq data analysis with clone-level classification.

Installation

This project uses uv for fast Python package management.

Prerequisites

Python 3.11 or later
uv installed

Device Support: TRACE automatically detects and utilizes the best available device for XGBoost training:

CUDA/GPU: Automatically used when available (highest priority)
Apple Silicon: Optimized for Apple Silicon Macs (CPU training for optimal compatibility)
CPU: Fallback for all systems

Installation

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Verify installation
trace --help

Core Dependencies

TRACE requires the following production dependencies:

click>=8.0.0 - CLI framework
pyyaml>=6.0 - Configuration file parsing
pydantic>=2.0.0 - Data validation
pydantic-settings>=2.0.0 - Settings management
pandas>=2.0.0 - Data manipulation
numpy>=1.20.0 - Numerical computing
scikit-learn>=1.6.0,<2.0 - Machine learning utilities
xgboost>=1.7.0 - Gradient boosting framework
scanpy>=1.9.0 - Single-cell analysis
anndata>=0.9.0 - Annotated data structures
scipy>=1.10.0 - Scientific computing
matplotlib>=3.7.0 - Plotting
seaborn>=0.12.0 - Statistical visualization
scikit-misc>=0.2.0 - Additional scikit-learn utilities
psutil>=5.9.0 - System utilities
optuna>=3.5.0 - Hyperparameter optimization
boto3>=1.26.0 - AWS SDK
mlflow>=2.10.0 - ML experiment tracking
psycopg2-binary>=2.9.0 - PostgreSQL adapter
python-dotenv>=1.0.0 - Environment variable management
requests>=2.31.0 - HTTP library
shap>=0.43.0 - SHAP values for model interpretability

Essential Commands

TRACE provides a comprehensive CLI for all machine learning operations. Here are the essential commands organized by workflow phase:

Phase 1: Data Preprocessing

TRACE supports two distinct preprocessing workflows depending on your analysis needs:

Single-Cell Level Processing

For analyses that require individual cell-level data without clone summarization:

# Preprocess at single-cell level (no clone summarization)
trace preprocess data.h5ad --skip-clone-summarization

# Custom single-cell preprocessing
trace preprocess data.h5ad \
  --skip-clone-summarization \
  --gene-selection-method all \
  --n-top-genes 500 \
  --min-genes-per-cell 100 \
  --cell-id-column barcode \
  --label-column trt_label \
  --output-file single_cell_data.h5

Clone-Level Processing

For analyses that require clone-level expression summaries:

# Preprocess with clone summarization (default)
trace preprocess data.h5ad

# Custom clone-level preprocessing
trace preprocess data.h5ad \
  --gene-selection-method highly_variable \
  --n-top-genes 500 \
  --summary-method percentile \
  --percentile 75.0 \
  --min-cells-per-clone 1 \
  --clone-id-column clone_ID \
  --label-column trt_label \
  --output-file clone_data.h5

Configuration Files

TRACE provides several pre-configured files for different analysis types:

# Use pre-configured preprocessing files
trace --config config/config_preprocess_sc.yaml preprocess data.h5ad      # Single-cell
trace --config config/config_preprocess_clone.yaml preprocess data.h5ad   # Clone-level

# Use pre-configured training file
trace --config config/config_train_evaluate.yaml train processed_data.h5

Single-Cell Configuration (config/config_preprocess_sc.yaml):

preprocessing:
  gene_selection_method: all
  min_cells_per_clone: 1
  min_genes_per_cell: 100
  n_top_genes: 500
  percentile: 75.0
  quality_threshold: 1  # No filtering
  summary_method: percentile
  clone_id_column: clone_ID
  fallback_to_default_clone_ids: false
  label_column: trt_label
  fallback_to_default_labels: true
  skip_clone_summarization: true  # Key difference

Clone-Level Configuration (config/config_preprocess_clone.yaml):

preprocessing:
  gene_selection_method: all
  min_cells_per_clone: 1
  min_genes_per_cell: 100
  n_top_genes: 500
  percentile: 75.0
  quality_threshold: 1  # No filtering
  summary_method: percentile
  clone_id_column: clone_ID
  fallback_to_default_clone_ids: false
  label_column: trt_label
  fallback_to_default_labels: true
  skip_clone_summarization: false  # Key difference

Quality Assessment and Reporting

# Generate comprehensive quality assessment report
trace preprocess data.h5ad --quality-report

# Use configuration file with quality reporting
trace --config config/config_preprocess_sc.yaml preprocess data.h5ad --quality-report

Configuration Management

# Create default configuration template
trace init-config --output-file my_config.yaml

# Create batch processing configuration template
trace init-config --config-type batch --output-file batch_config.yaml

# Validate configuration file
trace validate-config my_config.yaml

# Use configuration file
trace --config config/tracepy_default.yaml preprocess data.h5ad

Available Configuration Files:

config/config_preprocess_sc.yaml - Single-cell preprocessing configuration
config/config_preprocess_clone.yaml - Clone-level preprocessing configuration
config/config_train_evaluate.yaml - Default training and evaluation configuration

Phase 2: Model Training

Basic Training

# Train XGBoost model with default settings
trace train processed_data.h5

# Train specific algorithm
trace train processed_data.h5 --algorithm randomforest

# Train with custom hyperparameters
trace train processed_data.h5 \
  --algorithm xgboost \
  --n-estimators 200 \
  --max-depth 8 \
  --learning-rate 0.05 \
  --cv-folds 10

Hyperparameter Tuning

# Enable hyperparameter tuning with random search
trace train processed_data.h5 \
  --hyperparameter-tuning \
  --n-iter 50

# Use Optuna Bayesian optimization
trace train processed_data.h5 \
  --hyperparameter-tuning \
  --tuning-method optuna \
  --optuna-trials 100 \
  --optuna-sampler tpe

# Grid search
trace train processed_data.h5 \
  --hyperparameter-tuning \
  --tuning-method grid_search

Phase 3: Model Evaluation

Basic Evaluation

# Evaluate trained model
trace evaluate model_xgboost_42.joblib processed_data.h5

# Generate evaluation plots
trace evaluate model_xgboost_42.joblib processed_data.h5 --generate-plots

# Custom output report
trace evaluate model_xgboost_42.joblib processed_data.h5 \
  --output-report evaluation_report.json

Model Comparison

# Compare models in directory
trace compare ./models/

# Compare with specific data
trace compare ./models/ --data-file processed_data.h5

# Use specific primary metric
trace compare ./models/ --primary-metric accuracy

# Custom comparison report
trace compare ./models/ --output-comparison comparison_report.json

Extract Metrics from Output Directories

# Extract metrics from parameter sweep directory (nested structure)
trace extract-metrics output_121925

# Extract metrics from single training run directory
trace extract-metrics output/eb10/training/seed_101_optuna_percentile_25_topngenes_50

# Specify custom output CSV file
trace extract-metrics output_121925 --output-file metrics_consolidated.csv

What extract-metrics Does:

Automatically detects directory structure (parameter sweep vs single training run)
Scans for test_results_*.json files in training directories
Extracts all model performance metrics (accuracy, precision, recall, F1, ROC-AUC, PR-AUC, etc.)
Parses metadata from folder names (seed, method, percentile, topngenes, run_id)
Generates a consolidated CSV file with all metrics and metadata

Supported Directory Structures:

Parameter Sweep Structure:

output_dir/
├── run_type1/ (e.g., eb10bal, eb20bal, lognorm)
│   └── training/
│       └── seed_{seed}_{method}_percentile_{percentile}_topngenes_{topngenes}/
│           └── test_results_*.json
├── run_type2/
│   └── training/
│       └── ...

Single Training Run Structure:

output_dir/
├── test_results_*.json
OR
output_dir/
├── training/
│   └── test_results_*.json
OR
output_dir/
└── seed_{seed}_{method}_percentile_{percentile}_topngenes_{topngenes}/
    └── test_results_*.json

Output CSV Columns:

Metadata: model_id, run_id, seed, method, percentile, topngenes, algorithm, optimal_threshold, n_selected_genes
Test metrics: All metrics from test_metrics (prefixed with test_)
Test data info: All fields from test_data_info (prefixed with test_data_)


### Phase 4: Model Retraining

#### Optimal Model Selection and Retraining

```bash
# Retrain model on full dataset using optimal hyperparameters from CV
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5

# Retrain with hyperparameter retuning on full dataset
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
  --retune-hyperparameters

# Override class balancing for retraining
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
  --balance-classes "yes:no_15:85"

# Use best hyperparameters from CV results instead of retuning
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
  --use-best-hyperparameters

Important Notes for Retraining:

Optimal Model Selection: Choose the best performing model from cross-validation results based on your primary metric (e.g., F1-score, PR-AUC)
Hyperparameter Strategy: Use --retune-hyperparameters to optimize hyperparameters on the full dataset, --use-best-hyperparameters to use the best hyperparameters from CV results, or omit to use the most stable hyperparameters from CV
Class Balancing: Ensure consistent class balancing between training and retraining using the same --balance-classes specification
Gene Selection: The retrained model uses the same gene selection from the original CV training for consistency

Phase 5: Production Usage

Preprocessing New Data with a Trained Model

Required Step: Before making predictions, new data must be preprocessed using preprocess-with-model to ensure proper feature alignment with the trained model. The predict command will enforce this requirement.

# Preprocess new .h5ad data using a trained model
trace preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
  --output-file new_data_aligned.h5

# Preprocess 10x Genomics HDF5 file (automatically detected)
trace preprocess-with-model model_xgboost_42.joblib \
  filtered_feature_bc_matrix.h5 \
  --output-file new_data_aligned.h5

# Preprocess from HTTPS URL
trace preprocess-with-model model_xgboost_42.joblib \
  https://example.com/data.h5ad \
  --output-file new_data_aligned.h5

# Preprocess already-processed .h5 data (re-aligns to model's training genes)
trace preprocess-with-model model_xgboost_42.joblib new_data_processed.h5 \
  --output-file new_data_aligned.h5

# Use configuration file for clone ID/label column settings
trace --config config/config_preprocess_clone.yaml \
  preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
  --output-file new_data_aligned.h5

What preprocess-with-model Does:

Aligns input features to the model's training gene set (pads missing genes with zeros, ignores extra genes)
Applies the same preprocessing transformations used during training (e.g., expression binning with the same number of bins)
Ensures gene order matches the training data exactly
Creates provenance metadata that predict uses to verify data compatibility

Key Features:

Accepts both .h5ad (raw) and .h5 (preprocessed) input files
Supports 10x Genomics HDF5 files (.h5 with matrix group) - automatically detected and loaded
Supports HTTPS URLs for remote data files (automatically downloaded)
Always outputs .h5 format aligned to the model's training genes
Automatically applies expression binning if the model was trained with binning
Uses model metadata to ensure exact feature alignment

Making Predictions on New Data

Important: Data must be preprocessed with preprocess-with-model before prediction. The predict command will verify this automatically.

# Step 1: Preprocess new data with the model
trace preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
  --output-file new_data_aligned.h5

# Step 2: Make predictions on the aligned data
trace predict model_xgboost_42.joblib new_data_aligned.h5

# Include confidence scores
trace predict model_xgboost_42.joblib new_data_aligned.h5 --confidence-scores

# Custom output file
trace predict model_xgboost_42.joblib new_data_aligned.h5 \
  --output-predictions predictions.csv

Remote File Support and Chunked Processing

TRACE supports processing large files from various sources using memory-efficient chunked processing:

# Predict on S3 file with chunked processing
trace predict model_xgboost_42.joblib s3://bucket/data.h5ad \
  --preprocess-with-model \
  --chunk-size 10000 --output-predictions predictions.csv

# Predict on HTTPS URL (10x Genomics example)
trace predict model_xgboost_42.joblib \
  https://cf.10xgenomics.com/samples/cell-exp/4.0.0/SC3_v3_NextGem_SI_PBMC_10K/SC3_v3_NextGem_SI_PBMC_10K_filtered_feature_bc_matrix.h5 \
  --preprocess-with-model \
  --analysis-mode

# Local file with chunked processing
trace predict model_xgboost_42.joblib /path/to/large_data.h5ad \
  --preprocess-with-model \
  --chunk-size 10000 --output-predictions predictions.csv

10x Genomics Support:

Automatically detects 10x Genomics HDF5 format by checking for matrix group
Loads using scanpy's read_10x_h5() function
Processes through the same preprocessing pipeline as .h5ad files
Works with both local files and HTTPS URLs

Chunked Processing:

Default chunk size: 10,000 cells
Configurable via --chunk-size option
Memory usage stays constant regardless of file size
Automatically enabled for S3 paths, HTTPS URLs, or large local files

Supported Input Formats:

.h5ad files (AnnData format) - local, S3, or HTTPS URLs
.h5 files (TRACE preprocessed HDF5) - local, S3, or HTTPS URLs
10x Genomics HDF5 files (.h5 with matrix group) - automatically detected

Remote File Configuration:

S3: Set AWS credentials via environment variables:
- AWS_ACCESS_KEY_ID: Your AWS access key
- AWS_SECRET_ACCESS_KEY: Your AWS secret key
- AWS_SESSION_TOKEN: Optional session token (for temporary credentials)
- AWS_REGION or AWS_DEFAULT_REGION: AWS region (default: us-east-1)
HTTPS URLs: Automatically downloaded to temporary files (no configuration needed)

Selective Cell Prediction

Apply predictions only to cells of interest, with NA output for non-target cells:

# Filter cells by obs column value
trace predict model_xgboost_42.joblib data.h5ad \
  --preprocess-with-model \
  --target-cells cell_type:T_cell \
  --output-predictions predictions.csv

Cell Classification

Train a model to classify cells as target vs non-target before prediction:

# Train cell classification model
trace classify-cells data.h5 cell_type T_cell \
  --output-model cell_classifier.joblib \
  --algorithm xgboost

# Use cell classifier in prediction
trace predict model_xgboost_42.joblib data.h5ad \
  --preprocess-with-model \
  --cell-classifier cell_classifier.joblib \
  --output-predictions predictions.csv

Global Options

# Set log level
trace --log-level DEBUG preprocess data.h5ad

# Use configuration file
trace --config my_config.yaml train processed_data.h5

# Specify output directory
trace --output-dir ./results preprocess data.h5ad

Complete Workflow Example

# 1. Create configuration
trace init-config --output-file experiment_config.yaml

# 2. Preprocess data
trace --config config/tracepy_default.yaml preprocess data/experiment.h5ad

# 3. Train model with hyperparameter tuning
trace --config config/tracepy_default.yaml train data/processed/experiment_processed.h5 \
  --algorithm xgboost --hyperparameter-tuning --n-iter 50

# 2. Preprocess data (choose single-cell or clone-level)
# For single-cell analysis:
trace --config config/config_preprocess_sc.yaml preprocess data/experiment.h5ad

# For clone-level analysis:
trace --config config/config_preprocess_clone.yaml preprocess data/experiment.h5ad

# 3. Train multiple models with hyperparameter tuning using default config
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
  --algorithm xgboost --hyperparameter-tuning --n-iter 50
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
  --algorithm randomforest --hyperparameter-tuning --n-iter 30
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
  --algorithm adaboost --hyperparameter-tuning --n-iter 30

# 4. Compare models to select optimal model
trace compare models/ --data-file data/processed/experiment_processed.h5

# 4b. Extract metrics from all training runs (optional, for analysis)
trace extract-metrics output/ --output-file training_metrics_consolidated.csv

# 5. Retrain optimal model on full dataset
trace retrain models/best_model.joblib models/best_model_gene_ranks.json \
  data/processed/experiment_processed.h5 --retune-hyperparameters

# 6. Preprocess new data using the retrained model (recommended approach)
# This ensures proper feature alignment with the model's training genes
trace preprocess-with-model models/retrained_model.joblib data/new_experiment.h5ad \
  --output-file data/processed/new_experiment_aligned.h5

# Alternative: If new data is already preprocessed, re-align it to the model
trace preprocess-with-model models/retrained_model.joblib data/processed/new_experiment_processed.h5 \
  --output-file data/processed/new_experiment_aligned.h5

# 7. Make predictions with retrained model
trace predict models/retrained_model.joblib data/processed/new_experiment_aligned.h5 \
  --confidence-scores

Project Structure

TRACE/
├── src/tracepy/                     # Main package source code
│   ├── preprocessing/               # Phase 1: Data preprocessing
│   ├── models/                      # Phase 2: Model training
│   ├── evaluation/                  # Phase 3: Evaluation & metrics
│   ├── cli/                         # CLI interface and commands
│   └── utils/                       # Utility functions
├── config/                          # Configuration files
└── README.md                        # This file

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
config		config
models/retrained_tsm		models/retrained_tsm
scripts		scripts
src/tracepy		src/tracepy
tests		tests
trace_scores		trace_scores
.gitignore		.gitignore
CLI_commands.md		CLI_commands.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

TRACE

Installation

Prerequisites

Installation

Core Dependencies

Essential Commands

Phase 1: Data Preprocessing

Single-Cell Level Processing

Clone-Level Processing

Configuration Files

Quality Assessment and Reporting

Configuration Management

Phase 2: Model Training

Basic Training

Hyperparameter Tuning

Phase 3: Model Evaluation

Basic Evaluation

Model Comparison

Extract Metrics from Output Directories

Phase 5: Production Usage

Preprocessing New Data with a Trained Model

Making Predictions on New Data

Remote File Support and Chunked Processing

Selective Cell Prediction

Cell Classification

Global Options

Complete Workflow Example

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages