Skip to content

ksqtx/TRACE

Repository files navigation

TRACE

A production-grade machine learning pipeline for single-cell RNA-seq data analysis with clone-level classification.

Installation

This project uses uv for fast Python package management.

Prerequisites

  • Python 3.11 or later
  • uv installed

Device Support: TRACE automatically detects and utilizes the best available device for XGBoost training:

  • CUDA/GPU: Automatically used when available (highest priority)
  • Apple Silicon: Optimized for Apple Silicon Macs (CPU training for optimal compatibility)
  • CPU: Fallback for all systems

Installation

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Verify installation
trace --help

Core Dependencies

TRACE requires the following production dependencies:

  • click>=8.0.0 - CLI framework
  • pyyaml>=6.0 - Configuration file parsing
  • pydantic>=2.0.0 - Data validation
  • pydantic-settings>=2.0.0 - Settings management
  • pandas>=2.0.0 - Data manipulation
  • numpy>=1.20.0 - Numerical computing
  • scikit-learn>=1.6.0,<2.0 - Machine learning utilities
  • xgboost>=1.7.0 - Gradient boosting framework
  • scanpy>=1.9.0 - Single-cell analysis
  • anndata>=0.9.0 - Annotated data structures
  • scipy>=1.10.0 - Scientific computing
  • matplotlib>=3.7.0 - Plotting
  • seaborn>=0.12.0 - Statistical visualization
  • scikit-misc>=0.2.0 - Additional scikit-learn utilities
  • psutil>=5.9.0 - System utilities
  • optuna>=3.5.0 - Hyperparameter optimization
  • boto3>=1.26.0 - AWS SDK
  • mlflow>=2.10.0 - ML experiment tracking
  • psycopg2-binary>=2.9.0 - PostgreSQL adapter
  • python-dotenv>=1.0.0 - Environment variable management
  • requests>=2.31.0 - HTTP library
  • shap>=0.43.0 - SHAP values for model interpretability

Essential Commands

TRACE provides a comprehensive CLI for all machine learning operations. Here are the essential commands organized by workflow phase:

Phase 1: Data Preprocessing

TRACE supports two distinct preprocessing workflows depending on your analysis needs:

Single-Cell Level Processing

For analyses that require individual cell-level data without clone summarization:

# Preprocess at single-cell level (no clone summarization)
trace preprocess data.h5ad --skip-clone-summarization

# Custom single-cell preprocessing
trace preprocess data.h5ad \
  --skip-clone-summarization \
  --gene-selection-method all \
  --n-top-genes 500 \
  --min-genes-per-cell 100 \
  --cell-id-column barcode \
  --label-column trt_label \
  --output-file single_cell_data.h5

Clone-Level Processing

For analyses that require clone-level expression summaries:

# Preprocess with clone summarization (default)
trace preprocess data.h5ad

# Custom clone-level preprocessing
trace preprocess data.h5ad \
  --gene-selection-method highly_variable \
  --n-top-genes 500 \
  --summary-method percentile \
  --percentile 75.0 \
  --min-cells-per-clone 1 \
  --clone-id-column clone_ID \
  --label-column trt_label \
  --output-file clone_data.h5

Configuration Files

TRACE provides several pre-configured files for different analysis types:

# Use pre-configured preprocessing files
trace --config config/config_preprocess_sc.yaml preprocess data.h5ad      # Single-cell
trace --config config/config_preprocess_clone.yaml preprocess data.h5ad   # Clone-level

# Use pre-configured training file
trace --config config/config_train_evaluate.yaml train processed_data.h5

Single-Cell Configuration (config/config_preprocess_sc.yaml):

preprocessing:
  gene_selection_method: all
  min_cells_per_clone: 1
  min_genes_per_cell: 100
  n_top_genes: 500
  percentile: 75.0
  quality_threshold: 1  # No filtering
  summary_method: percentile
  clone_id_column: clone_ID
  fallback_to_default_clone_ids: false
  label_column: trt_label
  fallback_to_default_labels: true
  skip_clone_summarization: true  # Key difference

Clone-Level Configuration (config/config_preprocess_clone.yaml):

preprocessing:
  gene_selection_method: all
  min_cells_per_clone: 1
  min_genes_per_cell: 100
  n_top_genes: 500
  percentile: 75.0
  quality_threshold: 1  # No filtering
  summary_method: percentile
  clone_id_column: clone_ID
  fallback_to_default_clone_ids: false
  label_column: trt_label
  fallback_to_default_labels: true
  skip_clone_summarization: false  # Key difference

Quality Assessment and Reporting

# Generate comprehensive quality assessment report
trace preprocess data.h5ad --quality-report

# Use configuration file with quality reporting
trace --config config/config_preprocess_sc.yaml preprocess data.h5ad --quality-report

Configuration Management

# Create default configuration template
trace init-config --output-file my_config.yaml

# Create batch processing configuration template
trace init-config --config-type batch --output-file batch_config.yaml

# Validate configuration file
trace validate-config my_config.yaml

# Use configuration file
trace --config config/tracepy_default.yaml preprocess data.h5ad

Available Configuration Files:

  • config/config_preprocess_sc.yaml - Single-cell preprocessing configuration
  • config/config_preprocess_clone.yaml - Clone-level preprocessing configuration
  • config/config_train_evaluate.yaml - Default training and evaluation configuration

Phase 2: Model Training

Basic Training

# Train XGBoost model with default settings
trace train processed_data.h5

# Train specific algorithm
trace train processed_data.h5 --algorithm randomforest

# Train with custom hyperparameters
trace train processed_data.h5 \
  --algorithm xgboost \
  --n-estimators 200 \
  --max-depth 8 \
  --learning-rate 0.05 \
  --cv-folds 10

Hyperparameter Tuning

# Enable hyperparameter tuning with random search
trace train processed_data.h5 \
  --hyperparameter-tuning \
  --n-iter 50

# Use Optuna Bayesian optimization
trace train processed_data.h5 \
  --hyperparameter-tuning \
  --tuning-method optuna \
  --optuna-trials 100 \
  --optuna-sampler tpe

# Grid search
trace train processed_data.h5 \
  --hyperparameter-tuning \
  --tuning-method grid_search

Phase 3: Model Evaluation

Basic Evaluation

# Evaluate trained model
trace evaluate model_xgboost_42.joblib processed_data.h5

# Generate evaluation plots
trace evaluate model_xgboost_42.joblib processed_data.h5 --generate-plots

# Custom output report
trace evaluate model_xgboost_42.joblib processed_data.h5 \
  --output-report evaluation_report.json

Model Comparison

# Compare models in directory
trace compare ./models/

# Compare with specific data
trace compare ./models/ --data-file processed_data.h5

# Use specific primary metric
trace compare ./models/ --primary-metric accuracy

# Custom comparison report
trace compare ./models/ --output-comparison comparison_report.json

Extract Metrics from Output Directories

# Extract metrics from parameter sweep directory (nested structure)
trace extract-metrics output_121925

# Extract metrics from single training run directory
trace extract-metrics output/eb10/training/seed_101_optuna_percentile_25_topngenes_50

# Specify custom output CSV file
trace extract-metrics output_121925 --output-file metrics_consolidated.csv

What extract-metrics Does:

  • Automatically detects directory structure (parameter sweep vs single training run)
  • Scans for test_results_*.json files in training directories
  • Extracts all model performance metrics (accuracy, precision, recall, F1, ROC-AUC, PR-AUC, etc.)
  • Parses metadata from folder names (seed, method, percentile, topngenes, run_id)
  • Generates a consolidated CSV file with all metrics and metadata

Supported Directory Structures:

Parameter Sweep Structure:

output_dir/
├── run_type1/ (e.g., eb10bal, eb20bal, lognorm)
│   └── training/
│       └── seed_{seed}_{method}_percentile_{percentile}_topngenes_{topngenes}/
│           └── test_results_*.json
├── run_type2/
│   └── training/
│       └── ...

Single Training Run Structure:

output_dir/
├── test_results_*.json
OR
output_dir/
├── training/
│   └── test_results_*.json
OR
output_dir/
└── seed_{seed}_{method}_percentile_{percentile}_topngenes_{topngenes}/
    └── test_results_*.json

Output CSV Columns:

  • Metadata: model_id, run_id, seed, method, percentile, topngenes, algorithm, optimal_threshold, n_selected_genes
  • Test metrics: All metrics from test_metrics (prefixed with test_)
  • Test data info: All fields from test_data_info (prefixed with test_data_)

### Phase 4: Model Retraining

#### Optimal Model Selection and Retraining

```bash
# Retrain model on full dataset using optimal hyperparameters from CV
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5

# Retrain with hyperparameter retuning on full dataset
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
  --retune-hyperparameters

# Override class balancing for retraining
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
  --balance-classes "yes:no_15:85"

# Use best hyperparameters from CV results instead of retuning
trace retrain model_xgboost_42.joblib gene_ranks_42.json full_data.h5 \
  --use-best-hyperparameters

Important Notes for Retraining:

  • Optimal Model Selection: Choose the best performing model from cross-validation results based on your primary metric (e.g., F1-score, PR-AUC)
  • Hyperparameter Strategy: Use --retune-hyperparameters to optimize hyperparameters on the full dataset, --use-best-hyperparameters to use the best hyperparameters from CV results, or omit to use the most stable hyperparameters from CV
  • Class Balancing: Ensure consistent class balancing between training and retraining using the same --balance-classes specification
  • Gene Selection: The retrained model uses the same gene selection from the original CV training for consistency

Phase 5: Production Usage

Preprocessing New Data with a Trained Model

Required Step: Before making predictions, new data must be preprocessed using preprocess-with-model to ensure proper feature alignment with the trained model. The predict command will enforce this requirement.

# Preprocess new .h5ad data using a trained model
trace preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
  --output-file new_data_aligned.h5

# Preprocess 10x Genomics HDF5 file (automatically detected)
trace preprocess-with-model model_xgboost_42.joblib \
  filtered_feature_bc_matrix.h5 \
  --output-file new_data_aligned.h5

# Preprocess from HTTPS URL
trace preprocess-with-model model_xgboost_42.joblib \
  https://example.com/data.h5ad \
  --output-file new_data_aligned.h5

# Preprocess already-processed .h5 data (re-aligns to model's training genes)
trace preprocess-with-model model_xgboost_42.joblib new_data_processed.h5 \
  --output-file new_data_aligned.h5

# Use configuration file for clone ID/label column settings
trace --config config/config_preprocess_clone.yaml \
  preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
  --output-file new_data_aligned.h5

What preprocess-with-model Does:

  • Aligns input features to the model's training gene set (pads missing genes with zeros, ignores extra genes)
  • Applies the same preprocessing transformations used during training (e.g., expression binning with the same number of bins)
  • Ensures gene order matches the training data exactly
  • Creates provenance metadata that predict uses to verify data compatibility

Key Features:

  • Accepts both .h5ad (raw) and .h5 (preprocessed) input files
  • Supports 10x Genomics HDF5 files (.h5 with matrix group) - automatically detected and loaded
  • Supports HTTPS URLs for remote data files (automatically downloaded)
  • Always outputs .h5 format aligned to the model's training genes
  • Automatically applies expression binning if the model was trained with binning
  • Uses model metadata to ensure exact feature alignment

Making Predictions on New Data

Important: Data must be preprocessed with preprocess-with-model before prediction. The predict command will verify this automatically.

# Step 1: Preprocess new data with the model
trace preprocess-with-model model_xgboost_42.joblib new_data.h5ad \
  --output-file new_data_aligned.h5

# Step 2: Make predictions on the aligned data
trace predict model_xgboost_42.joblib new_data_aligned.h5

# Include confidence scores
trace predict model_xgboost_42.joblib new_data_aligned.h5 --confidence-scores

# Custom output file
trace predict model_xgboost_42.joblib new_data_aligned.h5 \
  --output-predictions predictions.csv

Remote File Support and Chunked Processing

TRACE supports processing large files from various sources using memory-efficient chunked processing:

# Predict on S3 file with chunked processing
trace predict model_xgboost_42.joblib s3://bucket/data.h5ad \
  --preprocess-with-model \
  --chunk-size 10000 --output-predictions predictions.csv

# Predict on HTTPS URL (10x Genomics example)
trace predict model_xgboost_42.joblib \
  https://cf.10xgenomics.com/samples/cell-exp/4.0.0/SC3_v3_NextGem_SI_PBMC_10K/SC3_v3_NextGem_SI_PBMC_10K_filtered_feature_bc_matrix.h5 \
  --preprocess-with-model \
  --analysis-mode

# Local file with chunked processing
trace predict model_xgboost_42.joblib /path/to/large_data.h5ad \
  --preprocess-with-model \
  --chunk-size 10000 --output-predictions predictions.csv

10x Genomics Support:

  • Automatically detects 10x Genomics HDF5 format by checking for matrix group
  • Loads using scanpy's read_10x_h5() function
  • Processes through the same preprocessing pipeline as .h5ad files
  • Works with both local files and HTTPS URLs

Chunked Processing:

  • Default chunk size: 10,000 cells
  • Configurable via --chunk-size option
  • Memory usage stays constant regardless of file size
  • Automatically enabled for S3 paths, HTTPS URLs, or large local files

Supported Input Formats:

  • .h5ad files (AnnData format) - local, S3, or HTTPS URLs
  • .h5 files (TRACE preprocessed HDF5) - local, S3, or HTTPS URLs
  • 10x Genomics HDF5 files (.h5 with matrix group) - automatically detected

Remote File Configuration:

  • S3: Set AWS credentials via environment variables:
    • AWS_ACCESS_KEY_ID: Your AWS access key
    • AWS_SECRET_ACCESS_KEY: Your AWS secret key
    • AWS_SESSION_TOKEN: Optional session token (for temporary credentials)
    • AWS_REGION or AWS_DEFAULT_REGION: AWS region (default: us-east-1)
  • HTTPS URLs: Automatically downloaded to temporary files (no configuration needed)

Selective Cell Prediction

Apply predictions only to cells of interest, with NA output for non-target cells:

# Filter cells by obs column value
trace predict model_xgboost_42.joblib data.h5ad \
  --preprocess-with-model \
  --target-cells cell_type:T_cell \
  --output-predictions predictions.csv

Cell Classification

Train a model to classify cells as target vs non-target before prediction:

# Train cell classification model
trace classify-cells data.h5 cell_type T_cell \
  --output-model cell_classifier.joblib \
  --algorithm xgboost

# Use cell classifier in prediction
trace predict model_xgboost_42.joblib data.h5ad \
  --preprocess-with-model \
  --cell-classifier cell_classifier.joblib \
  --output-predictions predictions.csv

Global Options

# Set log level
trace --log-level DEBUG preprocess data.h5ad

# Use configuration file
trace --config my_config.yaml train processed_data.h5

# Specify output directory
trace --output-dir ./results preprocess data.h5ad

Complete Workflow Example

# 1. Create configuration
trace init-config --output-file experiment_config.yaml

# 2. Preprocess data
trace --config config/tracepy_default.yaml preprocess data/experiment.h5ad

# 3. Train model with hyperparameter tuning
trace --config config/tracepy_default.yaml train data/processed/experiment_processed.h5 \
  --algorithm xgboost --hyperparameter-tuning --n-iter 50

# 2. Preprocess data (choose single-cell or clone-level)
# For single-cell analysis:
trace --config config/config_preprocess_sc.yaml preprocess data/experiment.h5ad

# For clone-level analysis:
trace --config config/config_preprocess_clone.yaml preprocess data/experiment.h5ad

# 3. Train multiple models with hyperparameter tuning using default config
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
  --algorithm xgboost --hyperparameter-tuning --n-iter 50
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
  --algorithm randomforest --hyperparameter-tuning --n-iter 30
trace --config config/config_train_evaluate.yaml train data/processed/experiment_processed.h5 \
  --algorithm adaboost --hyperparameter-tuning --n-iter 30

# 4. Compare models to select optimal model
trace compare models/ --data-file data/processed/experiment_processed.h5

# 4b. Extract metrics from all training runs (optional, for analysis)
trace extract-metrics output/ --output-file training_metrics_consolidated.csv

# 5. Retrain optimal model on full dataset
trace retrain models/best_model.joblib models/best_model_gene_ranks.json \
  data/processed/experiment_processed.h5 --retune-hyperparameters

# 6. Preprocess new data using the retrained model (recommended approach)
# This ensures proper feature alignment with the model's training genes
trace preprocess-with-model models/retrained_model.joblib data/new_experiment.h5ad \
  --output-file data/processed/new_experiment_aligned.h5

# Alternative: If new data is already preprocessed, re-align it to the model
trace preprocess-with-model models/retrained_model.joblib data/processed/new_experiment_processed.h5 \
  --output-file data/processed/new_experiment_aligned.h5

# 7. Make predictions with retrained model
trace predict models/retrained_model.joblib data/processed/new_experiment_aligned.h5 \
  --confidence-scores

Project Structure

TRACE/
├── src/tracepy/                     # Main package source code
│   ├── preprocessing/               # Phase 1: Data preprocessing
│   ├── models/                      # Phase 2: Model training
│   ├── evaluation/                  # Phase 3: Evaluation & metrics
│   ├── cli/                         # CLI interface and commands
│   └── utils/                       # Utility functions
├── config/                          # Configuration files
└── README.md                        # This file

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Production version of TRACE - Tumor Reactivity Assessment using Clonal Expression

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors