Quant_Model_Testbench is a lightweight experimentation framework for systematically evaluating machine learning models across feature subsets and hyperparameter combinations.
Instead of manually trying different model configurations, the testbench automates experiment generation, execution, and logging. Results are stored incrementally so experiments can be analyzed later and promising configurations can be refined through deeper searches.
The repository currently demonstrates the framework using the Titanic survival prediction dataset, but the testbench itself is dataset-agnostic and can be applied to any structured dataset.
Machine learning experimentation often becomes disorganized:
- repeated manual testing
- inconsistent experiment tracking
- hyperparameter tuning done ad-hoc
- results scattered across notebooks
Quant_Model_Testbench addresses this by providing a simple system that:
- enumerates feature combinations
- tests hyperparameter grids
- logs structured experiment results
- supports iterative model refinement
The goal is to make model experimentation systematic, reproducible, and analyzable.
The testbench explores model performance along two primary axes.
Different combinations of dataset features are tested to determine which subsets contain the strongest predictive signal.
Example feature combinations:
[Pclass, Sex]
[Pclass, Sex, Fare]
[Sex, Age, Fare, Parch]
Each model is evaluated across different hyperparameter settings.
Example:
RandomForestClassifier
├── n_estimators = [10, 50, 100]
├── criterion = [gini, entropy]
Together these generate many experiment configurations which the testbench evaluates automatically.
The system separates dataset handling, feature exploration, model execution, and experiment logging.
┌────────────────────┐
│ Input Dataset │
│ (CSV / Pandas) │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ Feature Pool │
│ Feature Subsetting │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ Model Testbench │
│ Experiment Engine │
└─────────┬──────────┘
│
┌──────────────┴──────────────┐
▼ ▼
Hyperparameter Generator Model Execution
(grid / combos) (sklearn models)
│ │
└──────────────┬──────────────┘
▼
┌────────────────────┐
│ Metric Engine │
│ ACC, AUC, F1, MAE │
└─────────┬──────────┘
▼
┌────────────────────┐
│ Experiment Log │
│ CSV + JSONL store │
└─────────┬──────────┘
▼
┌────────────────────┐
│ Result Analysis │
│ Best model ranking │
└────────────────────┘
The framework supports two experimentation modes.
Quick mode performs a broad exploration of the search space.
Characteristics:
- tests many feature subsets
- uses limited hyperparameter combinations
- runs relatively fast
Purpose:
Identify promising feature sets and models.
Full mode performs deep hyperparameter searches.
Characteristics:
- selected feature subsets are locked
- full hyperparameter grids are explored
- focuses on optimizing promising models
Purpose:
Find the best configuration for the most promising models discovered during quick mode.
The current implementation supports the following scikit-learn models:
- DecisionTreeClassifier
- DecisionTreeRegressor
- RandomForestClassifier
- RandomForestRegressor
Both classification and regression approaches are supported.
Experiments can be ranked using several metrics:
| Metric | Description |
|---|---|
| MAE | Mean Absolute Error |
| LL | Log Loss |
| ACC | Accuracy |
| AUC | ROC AUC Score |
| F1 | F1 Score |
The ranking metric can be selected interactively during result analysis.
All experiment results are written incrementally to both:
- CSV files for quick inspection
- JSONL files for structured experiment records
Each experiment entry includes:
- model type
- feature subset
- hyperparameters
- evaluation metrics
Example log entry:
{
"model": "RandomForestClassifier",
"features": ["Pclass", "Sex", "Parch", "Fare"],
"hyper": {
"n_estimators": 100,
"criterion": "entropy"
},
"metrics": {
"ACC": 0.8659,
"AUC": 0.8612,
"F1": 0.8286,
"MAE": 0.1341
}
}A quick experiment run produced 250 model configurations.
Top configurations ranked by AUC:
| Rank | Model | Key Features | AUC | Accuracy |
|---|---|---|---|---|
| 1 | RandomForestClassifier | Pclass, Sex, Fare, Parch | ~0.86 | ~0.86 |
| 2 | RandomForestClassifier | Passenger class + demographics | ~0.85 | ~0.86 |
| 3 | DecisionTreeClassifier | Sex, Fare, Pclass | ~0.84 | ~0.84 |
Observed strong predictive features:
Sex
Pclass
Fare
Parch
These features capture key demographic and socioeconomic signals associated with survival outcomes.
Experiment results are organized into timestamped directories.
output/
│
├── quick_feature_combos/
│ └── <timestamp>/
│ ├── q.csv
│ └── q.jsonl
│
├── quick_features_full_hypers_combos/
│ └── <timestamp>/
│ ├── qf.csv
│ └── qf.jsonl
│
└── full_features_full_hypers_combos/
└── <timestamp>/
├── ff.csv
└── ff.jsonl
This prevents overwriting previous experiment results and allows long-term experiment tracking.
git clone https://github.com/yourusername/Quant_Model_Testbench
cd Quant_Model_Testbench
./setup.py
This script creates a virtual environment and installs required dependencies.
python main.py
The CLI will guide you through:
- starting a new experiment
- selecting quick or full test modes
- ranking models by evaluation metrics
- running deeper hyperparameter searches
$ python main.py
Proceed with a fresh Model Testbench instead of analyzing past results? (Y/n)
> Y
Of the two available test modes - "quick" and "full", would you like to proceed with "quick"? (Y/n)
> Y
After experiments complete:
Top Models according to AUC
------------------------------------------------
1 | RandomForestClassifier
Features: [Pclass, Sex, Parch, Fare]
2 | RandomForestClassifier
Features: [Pclass, Sex, Age, Fare]
3 | DecisionTreeClassifier
Features: [Sex, Fare, Pclass]
The user can then select a configuration for deeper testing.
The repository demonstrates the testbench using the Titanic dataset from the Kaggle competition:
Titanic – Machine Learning from Disaster
Expected dataset location:
kaggle_data/train.csv
However, the framework can ingest any structured CSV dataset with a defined prediction target.
Quant_Model_Testbench
│
├── main.py
│
├── src/
│ └── test_utils.py
│
├── kaggle_data/
│ └── train.csv
│
├── output/
│
├── data_xplore.py
│
└── README.md
The intended experimentation cycle:
1. Load dataset
2. Run quick feature sweep
3. Identify top feature sets
4. Select promising model
5. Lock feature subset
6. Run full hyperparameter grid
7. Evaluate best configuration
8. Iterate or deploy
This workflow helps prevent:
- ad-hoc tuning
- lost experiment configurations
- unreproducible results
Quant_Model_Testbench focuses on:
Every experiment is logged and recoverable.
Feature and hyperparameter combinations are generated systematically.
Broad exploration first, followed by focused optimization.
Possible extensions include:
- additional models (XGBoost, LightGBM, CatBoost)
- experiment parallelization
- cross-validation integration
- automated feature importance analysis
- visualization dashboards
- experiment comparison tools
MIT License