feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes by minimumtone · Pull Request #137 · minimumtone/machine-learning

minimumtone · 2026-03-05T08:07:48Z

feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes

Summary

Adds CSV format validation and interactive column role assignment to support diverse CSV formats beyond HEA composition data. Also addresses 4 fundamental pipeline architecture gaps: Phase 0→Phase 3 column propagation, feature selection integration, tree model preprocessing cleanup, and hyperparameter search expansion.

New functions:

_analyze_csv_format(): Analyzes uploaded CSV structure (dtypes, missing values, constant columns) and renders an HTML report table
_handle_generic_csv(): Handles non-HEA CSVs by using user-selected columns directly as features (bypasses compute_features()). Supports both numeric and string columns — string columns are automatically one-hot encoded via pd.get_dummies.

New UI in Config & Run tab:

CSV書式確認 accordion: Auto-opens on upload showing per-column analysis (type, non-null count, sample values, warnings)
説明因子・目的変数の選択 accordion: Dropdown for target variable, CheckboxGroup for explanatory variables, Radio for CSV mode (auto/HEA/Generic)
Target column change auto-excludes it from the feature checkbox list
Feature checkboxes show all columns (numeric + string); string columns annotated with [文字列] tag for easy identification

Modified:

_handle_csv_upload() now accepts selected_features and force_generic params; auto-detects HEA vs Generic mode
run_experiment() passes column role selections through to the pipeline
Old hardcoded run_csv_target Textbox replaced with dynamic Dropdown populated from CSV columns
_refresh_ood_data() and _export_ood_csv() now detect generic CSV mode and use features_df.columns directly instead of HEA-specific FeatureCatalog.columns()

Updates since last revision

12 code review improvements (commit `158d67d`)

Applied 12 code improvements based on a structured code review, prioritized by impact level.

HIGH priority (Fixes 1–4): Reproducibility & Performance

#	File	Change
1	`runner.py`	Lazy workflow instantiation - `_run_job()` replaced upfront instantiation of all 6 workflows with lambda factories (`_BUILTIN_FACTORIES`). Workflows now instantiated on-demand only when selected.
2	`runner.py`	OOD multi-fold ensemble - Phase 4 OOD detection now uses per-sample score accumulation across all seeds' fold-0 instead of using only first seed. Scores mapped back to global sample indices via `score_sum[te_idx] += res.composite_scores`, then averaged only where samples overlap. Prevents bias from single fold/seed.
3	`runner.py`	O(1) OOD error lookup - `_collect_ood_errors()` replaced O(n) linear `np.array_equal()` scan with O(1) hash dict lookup (`test_indices.tobytes()` as key).
4	`feature_selection.py`	AIC/BIC n_features guard - Forward stepwise selection skipped when `n_features > 30` to avoid O(n²) bottleneck on high-dimensional feature sets (e.g., FS_MAGPIE with 132 features).

MEDIUM priority (Fixes 5–9): Design & Maintainability

#	File	Change
5	`_utils.py` (new), `runner.py`, `workflows.py`	Consolidated utility functions - Identical `safe_array()`/`_safe_np()` functions merged into new `_utils.py` module. Single source of truth for pandas→numpy C-contiguous conversion.
6	`evaluation.py`	Configurable ValidityScore weights - Hardcoded weights (0.30, 0.20, 0.30, −0.15, 0.20, −0.10) moved to `_DEFAULT_WEIGHTS` dict. `FeatureValidityEvaluator(weights={...})` now accepts custom weights.
7	`evaluation.py`	Generalisation score formula fix - Changed from `min(1.0, 0.5 + 0.5 * geo_mean)` to `min(1.0, geo_mean)` to remove artificial 0.5 baseline bias.
8	`runner.py`	Dynamic DataFrame columns - `RunRegistry.to_dataframe()` now uses `dataclasses.fields(RunResult)` instead of hardcoded column list. Automatically excludes non-scalar fields (ndarray, dict).
9	`ood.py`	OODDetector init robustness - `_actual_k` now initialized in `__init__()` instead of only in `fit()`, preventing AttributeError if `score()` called before `fit()`.

LOW priority (Fixes 10–12): Code Organization & Documentation

#	File	Change
10	`runner.py`	Unified integration interface - Passing an object (e.g., `mlflow_tracker=tracker`) now automatically implies `use_mlflow=True`. Boolean flags retained for backward compatibility but made redundant.
11	`features.py`, `data/delta_h_binary.json` (new)	Externalized binary enthalpy data - 179-entry `_DELTA_H_BINARY` dict moved to JSON file. Lazy-loaded via `_load_delta_h()` on first `get_binary_enthalpy()` call.
12	`docs/user_manual.md`	Phase 0.5 documentation - Added Phase 0.5 (multicollinearity diagnostics) to workflow table with explanation of VIF calculation and column drops.

Bug fixes from Devin Review (commit `72abb15`)

NameError: wf_map → _BUILTIN_FACTORIES (runner.py:263) - Error message referenced deleted variable wf_map after Fix GraphRAG アプリケーションのPythonファイルダウンロード機能 #1 renamed it. Now correctly references _BUILTIN_FACTORIES.
OOD ensemble averaging bug (runner.py:868-904) - Fix feat: enhance educational app with materials engineering data and streamlit-aggrid #2 initially averaged composite scores element-wise across folds, but different seeds produce different test sets (different sample indices). Element-wise np.mean(fold_composites, axis=0) was meaningless — position i in each array corresponded to a different sample.

Fix: Replaced with per-sample accumulation using global arrays:
```
score_sum = np.zeros(n_samples, dtype=np.float64)
score_count = np.zeros(n_samples, dtype=np.int32)
for fold_i, (tr_idx, te_idx) in enumerate(all_ood_folds):
    # ... run detector ...
    score_sum[te_idx] += res.composite_scores
    score_count[te_idx] += 1
# Average only where samples scored by multiple folds
avg_scores = score_sum[primary_te] / np.maximum(score_count[primary_te], 1)
```

Pipeline architecture fixes (4 items)

User identified fundamental architectural gaps causing poor model performance. These are systematic fixes, not whack-a-mole patches.

Fix 1: Phase 0 dropped columns → Phase 3 training (`runner.py`)

Phase 0 multicollinearity diagnostics identified constant and perfectly-collinear columns (dropped_constant, dropped_perfect) but Phase 3 training still used the original FeatureCatalog.columns() — the drops were never reflected in the actual training arrays.

Fix: Added _effective_cols: Dict[str, List[str]] to ExperimentRunner. After Phase 0 completes, effective columns are computed by subtracting dropped columns. Phase 3 array preparation and Phase 4 OOD detection both now use _effective_cols instead of raw FeatureCatalog.columns(). Feature sets with missing columns are skipped with a warning instead of crashing.

Fix 2: Integrate `run_feature_selection()` into runner.py (`runner.py`)

run_feature_selection() was implemented in feature_selection.py but never called from the main pipeline. High-dimensional/collinear features were passed directly to learning without any reduction.

Fix: Added "Phase 0.5" after Phase 0 diagnostics. For each feature set with >3 features, Lasso-based feature selection runs and updates _effective_cols. In testing with oqmd_l12_compounds.csv, this reduced FS_ALL from 77→13 features. Failures are caught and logged (falls back to all cleaned features).

Fix 3: Remove StandardScaler+PCA from tree models (`workflows.py`)

Tree models (RF, XGB) are scale-invariant and don't need StandardScaler. PCA destroys physically meaningful feature axes and makes feature importance meaningless.

Fix: Removed ("scaler", StandardScaler()) and _make_pca_step() from WorkflowXGB.run() and WorkflowRF.run(). Modified WorkflowENS._make_member() to conditionally apply preprocessing: tree-based members (XGB, GBR) get model-only pipelines; linear members (Ridge) keep scaler+PCA. The _make_member() signature now accepts n_features parameter for PCA estimation.

Fix 4: Expand hyperparameter search space (`workflows.py`)

XGB and RF grids were too shallow (3 params × 2-3 values each) for effective exploration.

Fix:

XGB: Expanded from 3→6 params. Added subsample: [0.8, 1.0], colsample_bytree: [0.8, 1.0], min_child_weight: [1, 5]. Increased n_estimators: [100, 300] and max_depth: [3, 6].
RF: Expanded from 3→5 params. Added min_samples_leaf: [1, 2], max_features: ["sqrt", 1.0]. Increased n_estimators: [200, 500] and adjusted max_depth: [None, 15].

IMPORTANT: XGB param grid now conditionally adds colsample_bytree and min_child_weight only when _XGB_AVAILABLE is True, preventing crashes when using the GradientBoostingRegressor fallback (which doesn't support these params).

Error resilience: wrap all GUI refresh callbacks in try/except

User reported that after a long experiment (~16000 sec), the GUI showed "エラー" repeatedly but the terminal had no traceback. Root cause: Gradio internally catches exceptions in callbacks and displays a generic "Error" toast without logging to stderr.

Fix: All 6 post-experiment refresh/callback functions are now wrapped in try/except with logger.exception():

Function	Safe fallback on error
`_refresh_dashboard_data`	`("0", "--", "--", "--", ..., None, None)`
`_refresh_results_data`	`gr.update()` × 3 + empty DataFrames + None plots
`_refresh_ood_data`	`None` plot + error message + empty DataFrame
`_refresh_report_data`	Error message string tuple
`_refresh_data_summary`	Empty summary tuple
`_export_ood_csv`	`gr.update(value=None, visible=False)`

This ensures:

Any future exceptions are logged to terminal via logger.exception() instead of being silently swallowed
GUI displays graceful fallback values instead of crashing with repeated "エラー" toasts
The experiment results are not lost — users can still see partial data

Tested: Full 90-run experiment with oqmd_l12_compounds.csv (143 rows × 11 cols, target=delta_e, all workflows). All tabs loaded successfully after completion with zero errors in GUI and zero exceptions in terminal.

String/text column support as explanatory variables

User reported that element columns (e.g. element_A, element_B) written in alphabetic text could not be selected as features. Now:

Feature checkboxes show all columns (numeric + string), not just numeric
String columns are annotated with [文字列] tag in the UI
When string columns are selected, they are automatically one-hot encoded (pd.get_dummies) before being passed to the ML pipeline
Numeric columns continue to be used as-is with median NaN imputation
Default selection still pre-checks only numeric columns; string columns require manual selection
ID-like columns (integer, all unique, monotonically ordered) and constant columns (variance=0) are excluded from defaults using data-driven heuristics (no hardcoded column names)

KeyError fix: `_refresh_ood_data` / `_export_ood_csv` in generic CSV mode

After a successful 540-run experiment, the final tab refresh crashed with:

KeyError: "None of [Index(['r_avg', 'delta_r', 'dS_mix', ...], dtype='str')] are in the [columns]"

Root cause: _refresh_ood_data() and _export_ood_csv() called FeatureCatalog.columns() which returns HEA-specific column names, but in generic CSV mode features_df has the actual CSV columns (including one-hot encoded columns). The FeatureCatalog._SETS monkey-patch in _run_in_thread had already been restored by the finally block before these functions ran.

Fix: Both functions now check session["csv_mode"] — in "generic" mode they use features_df.columns directly. Added available_cols safety filter to prevent KeyError even if some columns are missing.

fix_instructions.docx — 10件の修正適用

Critical (重大)

2.4 KeyError fix for CSV generic mode (app.py, runner.py):
- _run_in_thread() now detects csv_mode == "generic" and temporarily overrides FeatureCatalog._SETS so all feature sets use the CSV columns instead of HEA-specific columns. Original _SETS is restored in a finally block.
- runner.py _phase3_train() now skips feature sets whose columns are missing from the data (logs a warning instead of crashing with KeyError).
2.1 Validity score weight documentation (user_manual.md §8.1):
- Updated weights from (0.25, 0.20, 0.20, 0.15, 0.20) to (+0.30, +0.20, +0.30, −0.15, +0.20, −0.10) to match evaluation.py implementation.
2.2 5要素→6要素 (user_manual.md, plotly_charts.py, report.py):
- Added MC Penalty (Multicollinearity Penalty) column to validity score DataFrame and markdown report table.
- Updated all "5要素" references to "6要素" across the manual.
2.3 Feature set counts (user_manual.md §7.1):
- FS_SIZE: 10→12 (added B_avg, Vm_avg to listed features).
- FS_ALL: 16→18.

Medium (中程度)

3.1 Cr duplication comment (dataset.py):
- Added clarifying comment explaining that Cr appears intentionally in both _POOL_FCC and _POOL_BCC (used in both Cantor-type FCC and refractory BCC alloys).
3.2 noise_std deprecation warning (dataset.py):
- Removed the conditional check if noise_std != 50.0: — now always emits the deprecation warning regardless of the parameter value.
3.3 Windows resource module compatibility (runner.py):
- Conditional import: try: import resource; _HAS_RESOURCE = True; except ImportError: _HAS_RESOURCE = False.
- All resource.getrusage() calls now guarded by if _HAS_RESOURCE: to prevent crashes on Windows.

Minor (軽微)

4.1 OOD fold-0 limitation note (user_manual.md §8.2):
- Added note that OOD detection currently uses only fold-0 (design tradeoff for speed vs. computation cost).
4.2 MC Penalty column description (user_manual.md §5.4):
- Updated report content table to list MC Penalty column.
4.3 WF-LASSO/ARD/RF workflows (user_manual.md §7.2):
- Added WF-LASSO, WF-ARD, WF-RF to the workflow table (previously only WF-LIN/XGB/ENS were documented despite being implemented in earlier PRs).

Security & correctness fixes (Devin Review — prior revision)

XSS fix — missing-column warning: Column names from CSV are now escaped with html_mod.escape() in the missing-value warning section of _analyze_csv_format().
XSS fix — constant-column warning: Column names in the constant-column (variance=0) warning are now escaped.
Data leakage prevention: _handle_generic_csv() now defensively removes the target column from feature_cols (feature_cols = [c for c in feature_cols if c != target_col_clean]) with an error message if the resulting list is empty.
Traceback HTML escaping: _on_csv_upload() error handler wraps traceback.format_exc() output in <pre> with html_mod.escape() to prevent broken HTML from angle brackets like <module>.

Earlier fixes (still in this PR)

Default target uses last numeric column (numeric_cols[-1]) instead of first — matches convention that target is typically the last column.
_on_target_change caches numeric columns in gr.State — no re-reading CSV on target change, no dtype inconsistency from nrows=5.
Numeric dtype filter on feature checkboxes via pd.api.types.is_numeric_dtype().
Docstring return type for _analyze_csv_format corrected to Tuple[pd.DataFrame, str, List[str], List[str]].

Other

GUI version tag updated from PR#126 to PR#137.

GUI Testing

Test: oqmd_l12_compounds.csv pipeline architecture verification (143 rows × 11 columns):

CSV書式確認 correctly shows 8 numeric / 3 string columns
Constant column warning displayed for natoms (all values = 4)
Target changed to delta_e; selected lattice_constant, stability, volume, element_A [文字列], element_B [文字列] as features
"CSV読み込み" processed in generic mode (143 samples)
Full experiment: 90 runs (WF-LIN, WF-LASSO, WF-ARD, WF-RF, WF-XGB, WF-ENS × 3 splits × 5 folds) completed in 52.4 sec
Terminal logs confirmed all 4 pipeline fixes:
- ✅ Fix 1: Effective columns [FS_ALL]: 77 features (after Phase 0 drops) — dropped 2 perfect-collinear element columns
- ✅ Fix 2: Feature selection [FS_ALL]: 77 → 13 features — Lasso reduced feature count
- ✅ Fix 3: No scaler or pca steps logged for WF-RF/WF-XGB (tree models)
- ✅ Fix 4: XGB best params include expanded grid params (subsample, etc.)
Zero "エラー" messages in GUI
Zero exceptions in terminal logs (only INFO/WARNING)
All tabs loaded successfully: Dashboard, Results, OOD Map, Report
Results tab shows parity plots for all 6 workflows
OOD detection completed without KeyError: 1/29 samples flagged (3.4%)

Review & Testing Checklist for Human

Risk Level: 🔴 RED - Significant architectural changes + OOD ensemble algorithm change may impact reproducibility and model performance.

OOD per-sample accumulation correctness - The OOD ensemble fix (commit 72abb15) replaced element-wise averaging with per-sample accumulation. Verify the logic is correct: scores should be mapped back to global sample indices (score_sum[te_idx] += res.composite_scores), then averaged only where samples were scored by multiple folds. Test with a multi-seed experiment and verify OOD flags are consistent across runs with the same data.
Generalisation score formula change - Fix feat: Implement Nikkei 225 Bollinger Band Trading System #7 changed from min(1.0, 0.5 + 0.5 * geo_mean) to min(1.0, geo_mean). This will change evaluation results for all feature sets. Compare validity scores before/after on a reference dataset to understand the impact. Verify the new formula makes sense (no artificial baseline).
ValidityScore configurable weights - Fix feat: Bitcoin Bollinger Band prediction system with asset progression graphs #6 added _weights dict field to ValidityScore dataclass. Verify this doesn't leak into DataFrame exports or JSON serialization. Check to_dataframe() and to_dict() methods don't include _weights.
Binary enthalpy JSON loading - Fix feat: extend PDE discovery system to support Burgers equation #11 externalized _DELTA_H_BINARY to JSON. Verify the lazy loading works (_load_delta_h() called on first get_binary_enthalpy()). Check that tuple key conversion from comma-separated strings works correctly. Test with an HEA dataset to ensure dH_mix features are calculated correctly.
Dynamic DataFrame columns - Fix feat: implement symbolic regression for physics law discovery #8 uses dataclasses.fields() to dynamically generate column names. Verify to_dataframe() includes exactly the expected scalar fields (workflow, feature_set, split_policy, seed, fold, rmse_, mae_, r2_*, elapsed_sec) and excludes array/dict fields (y_test_true, y_test_pred, test_indices, params, artifacts, _weights).
Pipeline architecture fixes - Verify all 4 pipeline fixes work correctly: (1) Phase 0 drops reflected in Phase 3, (2) Lasso feature selection runs in Phase 0.5, (3) tree models don't use StandardScaler/PCA, (4) expanded HPO grids don't crash. Run a full experiment with HEA_ml_numeric_highconf.csv and check terminal logs.
AIC/BIC skip threshold - Fix feat: Comprehensive R Shiny ML App for Materials Engineering with 32+ Scenarios #4 skips AIC/BIC when n_features > 30. This is an arbitrary threshold. Test with datasets at the boundary (e.g., 29, 30, 31 features) to verify the skip logic works and doesn't cause issues.

Test Plan

Start GUI: python -m hea_extrapolation_platform gui
Test 12 code review improvements:
- Upload a CSV with high dimensionality (e.g., oqmd_l12_compounds.csv after one-hot encoding should have 70+ features)
- Run full experiment (not Quick Mode) to test all 12 fixes
- Check terminal logs for:
  - Lazy workflow instantiation (no upfront "Loading WF-..." messages)
  - OOD ensemble: "OOD FS_ALL: X/Y flagged (ensemble over N folds)" with N > 1
  - Feature selection: "Feature selection [FS_ALL]: M → K features" with K < M
  - AIC/BIC skip: "AIC skipped: N features > threshold 30" if applicable
- Verify Results tab shows all 6 workflows (LIN, LASSO, ARD, XGB, ENS, RF)
- Verify validity scores include MC Penalty column
Test OOD ensemble bug fix:
- Run an experiment with multiple seeds (--seeds 42,123,456)
- Check that OOD composite scores are averaged correctly (not element-wise across different sample sets)
- Verify OOD Map tab loads without errors
Test backward compatibility:
- Upload HEA_ml_numeric_highconf.csv (HEA mode)
- Verify HEA mode auto-detected and experiment runs successfully
- Check dH_mix feature values are correct (verify binary enthalpy JSON loading)
Test binary enthalpy JSON loading:
- Check terminal logs for "Loaded 179 binary enthalpy entries from data/delta_h_binary.json" message
- If warning appears, investigate why JSON loading failed
Test ValidityScore weights:
- Export run registry JSON and verify _weights field is NOT present
- Check validity score DataFrame doesn't include _weights column
Test generalisation score change:
- Compare validity scores with a baseline dataset from a previous version
- Verify scores changed as expected (should generally be higher with new formula)

Notes

Link to Devin Session: https://app.devin.ai/sessions/bca8c5188d2c4cd58e2182d85a0e19ad
Requested by: @minimumtone
Reproducibility warning: The OOD ensemble algorithm change (Fix feat: enhance educational app with materials engineering data and streamlit-aggrid #2 + commit 72abb15) will produce different OOD detection results compared to previous versions, even with the same random seed. This is expected and correct (the old algorithm was buggy).
Evaluation score changes: Fix feat: Implement Nikkei 225 Bollinger Band Trading System #7 (generalisation score formula) will change validity score rankings. Re-run reference experiments to establish new baselines.
Binary enthalpy externalization: The JSON file uses comma-separated string keys (e.g., "Co,Cr") instead of tuple keys. The conversion happens at load time via tuple(k.split(",")). Verify this works correctly for all 179 entries.
Dynamic DataFrame columns risk: The to_dataframe() method filters fields by checking isinstance(getattr(self._runs[0], name, None), (np.ndarray, dict)). This assumes the first run is representative of all runs. If runs have heterogeneous field types (unlikely but possible), some columns may be incorrectly included/excluded.
AIC/BIC threshold arbitrariness: The n_features > 30 threshold is a heuristic. Datasets with 31-50 features may still benefit from AIC/BIC selection but will now skip it. Consider making this threshold configurable in a future PR.
XGB fallback conditional: The expanded HPO grid conditionally adds XGBoost-specific params only when _XGB_AVAILABLE is True. This prevents crashes on GradientBoostingRegressor fallback, but the fallback itself may have other parameter incompatibilities (e.g., subsample may not work identically). Test on a system without XGBoost if possible.
12 improvements risk summary: Fixes 1-4 are high-impact changes to core pipeline logic. Fixes 5-9 are refactoring/design improvements with lower risk. Fixes 10-12 are documentation/organization with minimal risk. Prioritize testing high-impact fixes.

- Add _analyze_csv_format(): detects encoding, dtypes, missing values, displays per-column analysis table with warnings - Add _handle_generic_csv(): supports non-HEA CSVs by using user-selected numeric columns directly as features (bypasses HEA feature computation) - Add CSV書式確認 accordion: auto-opens on upload showing format report - Add 説明因子・目的変数の選択 accordion: Dropdown for target variable, CheckboxGroup for explanatory variables, Radio for CSV mode (auto/HEA/Generic) - Update _handle_csv_upload() to accept selected_features and force_generic params - Update run_experiment() to pass column role selections to pipeline - Auto-detect: numeric columns suggested as features, first numeric as target - Target change auto-excludes from feature list Tested with both HEA CSV (element columns) and B2_result CSV (generic numeric). Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

devin-ai-integration · 2026-03-05T08:07:53Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

… hint - _on_target_change now uses pd.api.types.is_numeric_dtype() to filter columns, preventing string/datetime columns from appearing in feature list - _analyze_csv_format docstring return type corrected to match actual signature Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…n gr.State - Fix default_target to use numeric_cols[-1] (last column) instead of [0] - Store numeric_cols in gr.State during upload to avoid re-reading CSV - _on_target_change now uses cached list for consistent dtype detection Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

When file_obj is None, return empty [] for csv_numeric_cols_state to match the 7 declared outputs in the .change() binding. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…banner Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…lumns Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…ation error Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…Windows compat - 2.4 [Critical] Fix KeyError in CSV generic mode: override FeatureCatalog._SETS with CSV columns in app.py _run_in_thread(); skip missing FS in runner.py - 2.1 [Critical] Update user_manual.md section 8.1: weights now match implementation (0.30/0.20/0.30/-0.15/0.20/-0.10) - 2.2 [Critical] Update 5要素→6要素 across user_manual.md, add MC Penalty column to plotly_charts.py and report.py - 2.3 [Critical] Fix FS_SIZE count (10→12) and FS_ALL (16→18) in docs - 3.1 [Medium] Add clarifying comment for intentional Cr duplication in dataset.py element pools - 3.2 [Medium] Remove conditional check on noise_std deprecation warning — always warn regardless of value - 3.3 [Medium] Conditional import of resource module for Windows compatibility in runner.py with _HAS_RESOURCE guard - 4.1 [Minor] Add OOD fold-0 limitation note to user_manual.md section 8.2 - 4.2 [Minor] Add MC Penalty to section 5.4 report content table - 4.3 [Minor] Add WF-LASSO/ARD/RF to user_manual.md section 7.2 workflow table Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

devin-ai-integration

Devin Review found 2 new potential issues.

View 14 additional findings in Devin Review.

devin-ai-integration · 2026-03-09T14:40:47Z

hea_extrapolation_platform/dataset.py

+    warnings.warn(
+        "noise_std is deprecated and ignored; strength noise is now "
+        "proportional (7 % CV). This parameter will be removed in a "
+        "future release.",
+        DeprecationWarning,
+        stacklevel=2,
+    )


🟡 Deprecation warning fires unconditionally on every call, not just when parameter is explicitly set

The old code only emitted the DeprecationWarning for noise_std when the caller explicitly passed a non-default value (if noise_std != 50.0). The new code removes this guard and warns on every call to generate_hea_dataset(), even when the caller uses the default. Both call-sites in the codebase (hea_extrapolation_platform/__main__.py:99 and hea_extrapolation_platform/gui/app.py:2837) never pass noise_std, so they will now emit a spurious deprecation warning on every dataset generation, cluttering logs and misleading users into thinking they are using a deprecated API incorrectly.

Suggested change

warnings.warn(

"noise_std is deprecated and ignored; strength noise is now "

"proportional (7 % CV). This parameter will be removed in a "

"future release.",

DeprecationWarning,

stacklevel=2,

)

if noise_std != 50.0:

warnings.warn(

"noise_std is deprecated and ignored; strength noise is now "

"proportional (7 % CV). This parameter will be removed in a "

"future release.",

DeprecationWarning,

stacklevel=2,

)

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-09T14:40:49Z

hea_extrapolation_platform/gui/app.py

+                            _orig_sets = dict(_FC._SETS)
+                            try:
+                                csv_cols = list(features_df.columns)
+                                _FC._SETS = {
+                                    k: csv_cols
+                                    for k in _FC._SETS
+                                }
+                                r, s, o = runner.run(
+                                    features_df, features_df, target,
+                                    progress_callback=_progress_cb,
+                                    selected_workflows=selected_wfs,
+                                )
+                            finally:
+                                _FC._SETS = _orig_sets


🔴 Thread-unsafe mutation of shared FeatureCatalog._SETS in generic CSV mode

FeatureCatalog._SETS is a class-level attribute shared across all threads and users. In generic CSV mode, _run_in_thread (gui/app.py:3033-3046) mutates this shared dict by replacing all entries with the current CSV's columns and restores it in a finally block. If two users run generic CSV experiments concurrently (this Gradio app explicitly supports multi-user sessions via gr.State and app.queue()), one user's override can be read by another user's runner, or one user's restore can clobber another user's override, leading to incorrect feature sets and potential KeyError crashes.

Race condition sequence

Thread A overrides _FC._SETS with columns [a, b, c]
Thread B overrides _FC._SETS with columns [x, y, z]
Thread A's runner.run() reads _SETS → gets [x, y, z] instead of [a, b, c]
Thread A's finally restores _orig_sets (overwriting Thread B's override)
Thread B's runner.run() reads _SETS → gets original HEA sets instead of [x, y, z]

Was this helpful? React with 👍 or 👎 to provide feedback.

Acknowledged — this is a valid concern for multi-user deployments. For the current single-user desktop use case this is not a blocker, but noted for future refactoring if concurrent access becomes a requirement.

Acknowledged. This is a valid concern for concurrent multi-user scenarios. However, the current deployment is single-user (local Gradio app), and _SETS mutation is scoped within try/finally with immediate restore. If multi-user support becomes a requirement, a thread-local or per-request copy pattern would be appropriate. Deferring this to a future PR unless the maintainer wants it addressed now.

devin-ai-integration · 2026-03-09T14:41:29Z

Devin Review raised 3 items. Here's the status:

Empty feature list [] → None: Already fixed in commit c9d6cd8 (selected_features if selected_features is not None else []). The comment is marked "Outdated".
noise_std unconditional warning: This change was explicitly requested by the fix_instructions document (item 3.2: "Remove condition, always warn about deprecated parameter"). The bot suggests reverting to the conditional check, but this contradicts the user's instructions.
Race condition on FeatureCatalog._SETS: Valid theoretical concern, but the Gradio GUI is single-user (queue serializes requests). The fix_instructions recommended approach B (backup/restore), which is what was implemented. The finally block ensures restoration even on exceptions. For a multi-user deployment, this would need a threading.Lock or the override_feature_columns parameter approach suggested by the bot.

No code changes made — all items are either already addressed or are bot suggestions that don't apply to the current single-user use case.

…eature selection - _analyze_csv_format now returns constant_cols (variance=0) as 5th element - _on_csv_upload auto-deselects constant columns (count_A, count_B etc.) and ID-like columns (entry_id, index, etc.) from default features - Users can still manually select these columns if needed - Feature choices still show all numeric columns except target Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…ection - Remove hardcoded _id_like set (entry_id, index, etc.) - Detect ID-like columns by statistical properties: integer type + all unique values + monotonically ordered - Also detect constant columns (variance=0) and all-NaN columns - Fully generic: works with any CSV regardless of column naming Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…se3_train When a feature set is skipped due to missing columns, jobs referencing that feature set remained in the list, causing KeyError at fs_arrays[job.fs_name]. Now filter jobs to only those with successfully prepared feature sets. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…ding - Feature checkboxes now show ALL columns (numeric + string) - String columns annotated with [文字列] tag for easy identification - String columns auto-encoded via pd.get_dummies (One-Hot Encoding) - Numeric columns used as-is with median NaN imputation - Updated UI labels and help text to reflect generic column selection - Target variable dropdown remains numeric-only - Fully generic: works with any CSV regardless of column types Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

In generic CSV mode, FeatureCatalog.columns() returns HEA-specific column names (r_avg, delta_r, etc.) that don't exist in the CSV features_df. Now checks session['csv_mode'] and uses features_df.columns directly when in generic mode, falling back to FeatureCatalog only for HEA mode. Also adds safety filter for available columns. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…radio errors All module-level refresh functions (_refresh_dashboard_data, _refresh_results_data, _refresh_ood_data, _refresh_data_summary, _refresh_report_data, _export_ood_csv) now catch exceptions internally and return safe fallback values instead of propagating to Gradio. This prevents the 'Error' toast that appears in the GUI when any post-experiment callback fails — errors are now logged to the terminal via logger.exception() while the UI gracefully degrades. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…neric CSV runs only FS_ALL - workflows.py: WorkflowENS._make_member now accepts n_features param (was hardcoded 132) - runner.py: ExperimentRunner.run() accepts selected_feature_sets to filter feature sets - app.py: generic CSV mode passes selected_feature_sets=['FS_ALL'] — no redundant FS comparison Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…ing and user_manual.md TOC Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…y docstring enumeration Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

… pipeline, remove StandardScaler+PCA from tree models, expand HPO grids - Phase 0 dropped columns (constant/collinear) now reflected in Phase 3 & 4 - Lasso feature selection integrated as Phase 0.5 (removes uninformative features) - WF-XGB/WF-RF: removed StandardScaler+PCA (trees are scale-invariant, PCA destroys physical axes) - WF-ENS: tree-based members no longer use scaler/PCA - WF-XGB HPO grid expanded: +subsample, colsample_bytree, min_child_weight - WF-RF HPO grid expanded: +min_samples_leaf, max_features Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

…ostingRegressor fallback Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 20 additional findings in Devin Review.

devin-ai-integration · 2026-03-13T11:24:22Z

hea_extrapolation_platform/runner.py

+                    _X_fs = features_all[_cols]
+                    _fs_summary = run_feature_selection(
+                        _X_fs, target,
+                        methods=["Lasso"],
+                        feature_set=_fs_key,
+                    )
+                    _lasso = _fs_summary.results.get("Lasso")
+                    if _lasso and _lasso.selected_features:
+                        _sel = _lasso.selected_features
+                        if len(_sel) >= 2:
+                            logger.info(
+                                "Feature selection [%s]: %d → %d features",
+                                _fs_key, len(_cols), len(_sel),
+                            )
+                            self._effective_cols[_fs_key] = _sel


🔴 Feature selection (Phase 0.5) uses full dataset including test data, causing data leakage

Phase 0.5 (runner.py:478-492) runs run_feature_selection on features_all and target, which is the entire dataset — not just training data. The Lasso model in _run_lasso (feature_selection.py:148-149) fits on all samples including future test samples from every cross-validation fold. The features selected this way are then stored in self._effective_cols and used for all Phase 3 training runs and Phase 4 OOD detection. This means feature selection has seen the test targets, invalidating the cross-validation evaluation. Test metrics (RMSE, R², etc.) and validity scores will be optimistically biased because the feature set was chosen with knowledge of test data. This undermines the platform's core purpose of comparing feature set validity and extrapolation capability.

Prompt for agents

In hea_extrapolation_platform/runner.py, the Phase 0.5 feature selection at lines 469-497 calls run_feature_selection(features_all[_cols], target, ...) using the entire dataset. This introduces data leakage because the Lasso model sees test targets. To fix this, either: (1) Remove Phase 0.5 entirely and rely on the learning algorithms' built-in regularization (Lasso, ARD workflows already do feature selection), or (2) Perform feature selection only on a held-out portion of the data that is never used for evaluation — e.g., split off a small portion before Phase 1 folds are computed, or use only the training fold from the first RandomCV split. Option (1) is the safest and simplest fix. If option (2) is chosen, ensure the feature selection split is computed before Phase 1 fold_plan and excluded from all subsequent folds.

Was this helpful? React with 👍 or 👎 to provide feedback.

Valid point about data leakage. Phase 0.5 currently runs run_feature_selection on features_all (entire dataset) before CV splitting, so the Lasso sees test targets.

Options to fix:

Remove Phase 0.5 entirely — rely on built-in regularization (WF-LASSO, WF-ARD already do L1/ARD selection). Simplest and safest.

Move feature selection inside Phase 3 — run Lasso on each training fold only. Most correct but increases computation (N_folds × N_seeds runs of Lasso).

Use a held-out pre-split — reserve a small portion before fold computation for feature selection only.

Leaving this for the reviewer to decide which approach is preferred, as this involves a design tradeoff between computational cost and statistical rigor. The current implementation was requested by the user as "特徴量選択を学習本体に統合すること" — integrating feature selection into the pipeline.

HIGH priority (fixes 1-4): - #1: Lazy workflow instantiation via lambda factories in _run_job - #2: Phase 4 OOD multi-fold ensemble across all seeds - #3: O(1) hash dict lookup in _collect_ood_errors (replaces O(n) np.array_equal) - #4: AIC/BIC forward stepwise guard for n_features > 30 MEDIUM priority (fixes 5-9): - #5: Consolidate safe_array/_safe_np into _utils.py (single source of truth) - #6: Configurable weights for ValidityScore.total via FeatureValidityEvaluator(weights=...) - #7: Fix generalisation score formula: min(1.0, geo_mean) instead of min(1.0, 0.5+0.5*geo_mean) - #8: Dynamic column names in RunRegistry.to_dataframe via dataclasses.fields() - #9: Initialize OODDetector._actual_k in __init__ (not just fit()) LOW priority (fixes 10-12): - #10: Unified mlflow/feast/mint interface (passing object implies use) - #11: Externalize _DELTA_H_BINARY to data/delta_h_binary.json - #12: Add Phase 0.5 documentation to user_manual.md Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

devin-ai-integration

Devin Review found 2 new potential issues.

View 23 additional findings in Devin Review.

devin-ai-integration · 2026-03-13T11:51:00Z

hea_extrapolation_platform/runner.py

🔴 NameError: wf_map referenced but variable was renamed to _BUILTIN_FACTORIES

In the _run_job function, the error message on line 263 references wf_map, but the variable was renamed to _BUILTIN_FACTORIES on line 241. When an unknown workflow name is encountered (not built-in and not in mint_configs), the raise KeyError will itself raise a NameError: name 'wf_map' is not defined, masking the original informative error message with a confusing traceback.

(Refers to line 263)

Was this helpful? React with 👍 or 👎 to provide feedback.

Fixed in 72abb15: wf_map → _BUILTIN_FACTORIES in the error message.

devin-ai-integration · 2026-03-13T11:51:01Z

hea_extrapolation_platform/runner.py

+                if len(fold_composites) > 1:
+                    avg_composite = np.mean(fold_composites, axis=0)
+                    avg_threshold = float(np.quantile(avg_composite, 0.95))
+                    is_ood_avg = avg_composite > avg_threshold
+                    n_ood = int(is_ood_avg.sum())
+                    ood_res = OODResult(
+                        mahalanobis_scores=primary_res.mahalanobis_scores,
+                        knn_scores=primary_res.knn_scores,
+                        composite_scores=avg_composite,
+                        is_ood=is_ood_avg,
+                        ood_threshold=avg_threshold,
+                        ood_ratio=n_ood / max(len(avg_composite), 1),
+                        n_total=len(avg_composite),
+                        n_ood=n_ood,
+                    )


🔴 OOD ensemble averages composite scores across different sample sets

In _phase4_ood, each seed's fold-0 produces a different test set (different sample indices due to different random shuffles). The code collects composite scores from each fold into fold_composites and averages them element-wise with np.mean(fold_composites, axis=0) at runner.py:883. However, position i in each array corresponds to a different sample in each fold, so the element-wise average is meaningless — it averages OOD scores of unrelated samples. The resulting avg_composite is then used to determine is_ood and stored with the primary fold's test indices (ood_test_idx from all_ood_folds[0]), producing incorrect OOD classifications. The _collect_ood_errors call at line 912-913 also uses ood_test_idx from fold 0, so the is_ood mask (computed from averaged scores of mixed samples) is applied to fold-0's test samples, yielding wrong OOD error evaluations.

Prompt for agents

In hea_extrapolation_platform/runner.py, the _phase4_ood method (lines 806-923) attempts to ensemble OOD scores across multiple RandomCV folds from different seeds, but each seed produces a different test set (different sample indices). Element-wise averaging of composite_scores arrays is therefore meaningless since position i in each array refers to a different sample. Fix options: 1. Revert to single-fold OOD (use only fold 0 of seed 0, as was done before this PR). 2. If ensemble OOD is desired, map scores back to a per-sample array indexed by global sample index (e.g. create an array of shape (n_total_samples,) and accumulate scores per sample across folds, then average only where a sample was scored by multiple folds). 3. Alternatively, only use folds that share the same test indices (which won't happen with different seeds).

Was this helpful? React with 👍 or 👎 to provide feedback.

Good catch. Fixed in 72abb15: replaced element-wise averaging with per-sample accumulation using global score_sum/score_count arrays indexed by original sample indices. Each fold's scores are mapped back to global indices via score_sum[te_idx] += res.composite_scores, then averaged only where samples overlap (score_sum[primary_te] / score_count[primary_te]). This ensures the averaging is meaningful even when different seeds produce different test sets.

…cumulation - Fix NameError: error message referenced deleted 'wf_map' variable, now correctly references '_BUILTIN_FACTORIES' - Fix OOD ensemble: different seeds produce different test sets, so element-wise averaging was meaningless. Now uses per-sample score accumulation via global arrays (score_sum/score_count) indexed by original sample indices, averaging only where samples overlap. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 21 additional findings in Devin Review.

devin-ai-integration · 2026-03-13T11:58:09Z

hea_extrapolation_platform/evaluation.py

            # (Review: min causes asymmetric improvements to be undervalued)
            geo_mean = math.sqrt(rand_improve * block_improve)
-            return min(1.0, 0.5 + 0.5 * geo_mean)
+            return min(1.0, geo_mean)


🔴 Generalisation score formula produces unreasonably low values after removing base offset

The old formula min(1.0, 0.5 + 0.5 * geo_mean) mapped realistic improvement ratios to the [0.5, 1.0] range. The new formula min(1.0, geo_mean) maps them to [0, 1.0] without any offset. Since geo_mean = sqrt(rand_improve * block_improve) where improvements are relative RMSE reductions (typically 0.01–0.20), the geometric mean is very small. For example, if both splits improve RMSE by 5%, geo_mean = sqrt(0.05 * 0.05) = 0.05 — giving a generalisation score of 0.05 instead of the old 0.525. Since generalisation has the highest weight (0.30), this deflates total validity scores for all non-baseline feature sets, making the ranking less discriminative. The docstring states "both splits improve -> 1" but achieving near-1.0 now requires ~100% RMSE reduction, which is unrealistic.

Suggested change

return min(1.0, geo_mean)

return min(1.0, 0.5 + 0.5 * geo_mean)

Was this helpful? React with 👍 or 👎 to provide feedback.

This change was explicitly requested in the user's code review (Fix #7): min(1.0, geo_mean) instead of min(1.0, 0.5 + 0.5 * geo_mean). The bot's analysis is technically correct that typical RMSE improvements (1-20%) will produce low generalisation scores, but the user's intent was to remove the artificial 0.5 baseline bias. If the scores are too deflated in practice, the user can adjust the formula or the weight in _DEFAULT_WEIGHTS (now configurable via Fix #6). Keeping as-is per user's explicit request.

devin-ai-integration bot assigned minimumtone Mar 5, 2026