Skip to content

feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes#137

Open
minimumtone wants to merge 24 commits intomainfrom
devin/1772697677-csv-validation-column-roles
Open

feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes#137
minimumtone wants to merge 24 commits intomainfrom
devin/1772697677-csv-validation-column-roles

Conversation

@minimumtone
Copy link
Copy Markdown
Owner

@minimumtone minimumtone commented Mar 5, 2026

feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes

Summary

Adds CSV format validation and interactive column role assignment to support diverse CSV formats beyond HEA composition data. Also addresses 4 fundamental pipeline architecture gaps: Phase 0→Phase 3 column propagation, feature selection integration, tree model preprocessing cleanup, and hyperparameter search expansion.

New functions:

  • _analyze_csv_format(): Analyzes uploaded CSV structure (dtypes, missing values, constant columns) and renders an HTML report table
  • _handle_generic_csv(): Handles non-HEA CSVs by using user-selected columns directly as features (bypasses compute_features()). Supports both numeric and string columns — string columns are automatically one-hot encoded via pd.get_dummies.

New UI in Config & Run tab:

  • CSV書式確認 accordion: Auto-opens on upload showing per-column analysis (type, non-null count, sample values, warnings)
  • 説明因子・目的変数の選択 accordion: Dropdown for target variable, CheckboxGroup for explanatory variables, Radio for CSV mode (auto/HEA/Generic)
  • Target column change auto-excludes it from the feature checkbox list
  • Feature checkboxes show all columns (numeric + string); string columns annotated with [文字列] tag for easy identification

Modified:

  • _handle_csv_upload() now accepts selected_features and force_generic params; auto-detects HEA vs Generic mode
  • run_experiment() passes column role selections through to the pipeline
  • Old hardcoded run_csv_target Textbox replaced with dynamic Dropdown populated from CSV columns
  • _refresh_ood_data() and _export_ood_csv() now detect generic CSV mode and use features_df.columns directly instead of HEA-specific FeatureCatalog.columns()

Updates since last revision

12 code review improvements (commit 158d67d)

Applied 12 code improvements based on a structured code review, prioritized by impact level.

HIGH priority (Fixes 1–4): Reproducibility & Performance

# File Change
1 runner.py Lazy workflow instantiation - _run_job() replaced upfront instantiation of all 6 workflows with lambda factories (_BUILTIN_FACTORIES). Workflows now instantiated on-demand only when selected.
2 runner.py OOD multi-fold ensemble - Phase 4 OOD detection now uses per-sample score accumulation across all seeds' fold-0 instead of using only first seed. Scores mapped back to global sample indices via score_sum[te_idx] += res.composite_scores, then averaged only where samples overlap. Prevents bias from single fold/seed.
3 runner.py O(1) OOD error lookup - _collect_ood_errors() replaced O(n) linear np.array_equal() scan with O(1) hash dict lookup (test_indices.tobytes() as key).
4 feature_selection.py AIC/BIC n_features guard - Forward stepwise selection skipped when n_features > 30 to avoid O(n²) bottleneck on high-dimensional feature sets (e.g., FS_MAGPIE with 132 features).

MEDIUM priority (Fixes 5–9): Design & Maintainability

# File Change
5 _utils.py (new), runner.py, workflows.py Consolidated utility functions - Identical safe_array()/_safe_np() functions merged into new _utils.py module. Single source of truth for pandas→numpy C-contiguous conversion.
6 evaluation.py Configurable ValidityScore weights - Hardcoded weights (0.30, 0.20, 0.30, −0.15, 0.20, −0.10) moved to _DEFAULT_WEIGHTS dict. FeatureValidityEvaluator(weights={...}) now accepts custom weights.
7 evaluation.py Generalisation score formula fix - Changed from min(1.0, 0.5 + 0.5 * geo_mean) to min(1.0, geo_mean) to remove artificial 0.5 baseline bias.
8 runner.py Dynamic DataFrame columns - RunRegistry.to_dataframe() now uses dataclasses.fields(RunResult) instead of hardcoded column list. Automatically excludes non-scalar fields (ndarray, dict).
9 ood.py OODDetector init robustness - _actual_k now initialized in __init__() instead of only in fit(), preventing AttributeError if score() called before fit().

LOW priority (Fixes 10–12): Code Organization & Documentation

# File Change
10 runner.py Unified integration interface - Passing an object (e.g., mlflow_tracker=tracker) now automatically implies use_mlflow=True. Boolean flags retained for backward compatibility but made redundant.
11 features.py, data/delta_h_binary.json (new) Externalized binary enthalpy data - 179-entry _DELTA_H_BINARY dict moved to JSON file. Lazy-loaded via _load_delta_h() on first get_binary_enthalpy() call.
12 docs/user_manual.md Phase 0.5 documentation - Added Phase 0.5 (multicollinearity diagnostics) to workflow table with explanation of VIF calculation and column drops.

Bug fixes from Devin Review (commit 72abb15)

  1. NameError: wf_map_BUILTIN_FACTORIES (runner.py:263) - Error message referenced deleted variable wf_map after Fix GraphRAG アプリケーションのPythonファイルダウンロード機能 #1 renamed it. Now correctly references _BUILTIN_FACTORIES.

  2. OOD ensemble averaging bug (runner.py:868-904) - Fix feat: enhance educational app with materials engineering data and streamlit-aggrid #2 initially averaged composite scores element-wise across folds, but different seeds produce different test sets (different sample indices). Element-wise np.mean(fold_composites, axis=0) was meaningless — position i in each array corresponded to a different sample.

    Fix: Replaced with per-sample accumulation using global arrays:

    score_sum = np.zeros(n_samples, dtype=np.float64)
    score_count = np.zeros(n_samples, dtype=np.int32)
    for fold_i, (tr_idx, te_idx) in enumerate(all_ood_folds):
        # ... run detector ...
        score_sum[te_idx] += res.composite_scores
        score_count[te_idx] += 1
    # Average only where samples scored by multiple folds
    avg_scores = score_sum[primary_te] / np.maximum(score_count[primary_te], 1)

Pipeline architecture fixes (4 items)

User identified fundamental architectural gaps causing poor model performance. These are systematic fixes, not whack-a-mole patches.

Fix 1: Phase 0 dropped columns → Phase 3 training (runner.py)

Phase 0 multicollinearity diagnostics identified constant and perfectly-collinear columns (dropped_constant, dropped_perfect) but Phase 3 training still used the original FeatureCatalog.columns() — the drops were never reflected in the actual training arrays.

Fix: Added _effective_cols: Dict[str, List[str]] to ExperimentRunner. After Phase 0 completes, effective columns are computed by subtracting dropped columns. Phase 3 array preparation and Phase 4 OOD detection both now use _effective_cols instead of raw FeatureCatalog.columns(). Feature sets with missing columns are skipped with a warning instead of crashing.

Fix 2: Integrate run_feature_selection() into runner.py (runner.py)

run_feature_selection() was implemented in feature_selection.py but never called from the main pipeline. High-dimensional/collinear features were passed directly to learning without any reduction.

Fix: Added "Phase 0.5" after Phase 0 diagnostics. For each feature set with >3 features, Lasso-based feature selection runs and updates _effective_cols. In testing with oqmd_l12_compounds.csv, this reduced FS_ALL from 77→13 features. Failures are caught and logged (falls back to all cleaned features).

Fix 3: Remove StandardScaler+PCA from tree models (workflows.py)

Tree models (RF, XGB) are scale-invariant and don't need StandardScaler. PCA destroys physically meaningful feature axes and makes feature importance meaningless.

Fix: Removed ("scaler", StandardScaler()) and _make_pca_step() from WorkflowXGB.run() and WorkflowRF.run(). Modified WorkflowENS._make_member() to conditionally apply preprocessing: tree-based members (XGB, GBR) get model-only pipelines; linear members (Ridge) keep scaler+PCA. The _make_member() signature now accepts n_features parameter for PCA estimation.

Fix 4: Expand hyperparameter search space (workflows.py)

XGB and RF grids were too shallow (3 params × 2-3 values each) for effective exploration.

Fix:

  • XGB: Expanded from 3→6 params. Added subsample: [0.8, 1.0], colsample_bytree: [0.8, 1.0], min_child_weight: [1, 5]. Increased n_estimators: [100, 300] and max_depth: [3, 6].
  • RF: Expanded from 3→5 params. Added min_samples_leaf: [1, 2], max_features: ["sqrt", 1.0]. Increased n_estimators: [200, 500] and adjusted max_depth: [None, 15].

IMPORTANT: XGB param grid now conditionally adds colsample_bytree and min_child_weight only when _XGB_AVAILABLE is True, preventing crashes when using the GradientBoostingRegressor fallback (which doesn't support these params).

Error resilience: wrap all GUI refresh callbacks in try/except

User reported that after a long experiment (~16000 sec), the GUI showed "エラー" repeatedly but the terminal had no traceback. Root cause: Gradio internally catches exceptions in callbacks and displays a generic "Error" toast without logging to stderr.

Fix: All 6 post-experiment refresh/callback functions are now wrapped in try/except with logger.exception():

Function Safe fallback on error
_refresh_dashboard_data ("0", "--", "--", "--", ..., None, None)
_refresh_results_data gr.update() × 3 + empty DataFrames + None plots
_refresh_ood_data None plot + error message + empty DataFrame
_refresh_report_data Error message string tuple
_refresh_data_summary Empty summary tuple
_export_ood_csv gr.update(value=None, visible=False)

This ensures:

  1. Any future exceptions are logged to terminal via logger.exception() instead of being silently swallowed
  2. GUI displays graceful fallback values instead of crashing with repeated "エラー" toasts
  3. The experiment results are not lost — users can still see partial data

Tested: Full 90-run experiment with oqmd_l12_compounds.csv (143 rows × 11 cols, target=delta_e, all workflows). All tabs loaded successfully after completion with zero errors in GUI and zero exceptions in terminal.

String/text column support as explanatory variables

User reported that element columns (e.g. element_A, element_B) written in alphabetic text could not be selected as features. Now:

  • Feature checkboxes show all columns (numeric + string), not just numeric
  • String columns are annotated with [文字列] tag in the UI
  • When string columns are selected, they are automatically one-hot encoded (pd.get_dummies) before being passed to the ML pipeline
  • Numeric columns continue to be used as-is with median NaN imputation
  • Default selection still pre-checks only numeric columns; string columns require manual selection
  • ID-like columns (integer, all unique, monotonically ordered) and constant columns (variance=0) are excluded from defaults using data-driven heuristics (no hardcoded column names)

KeyError fix: _refresh_ood_data / _export_ood_csv in generic CSV mode

After a successful 540-run experiment, the final tab refresh crashed with:

KeyError: "None of [Index(['r_avg', 'delta_r', 'dS_mix', ...], dtype='str')] are in the [columns]"

Root cause: _refresh_ood_data() and _export_ood_csv() called FeatureCatalog.columns() which returns HEA-specific column names, but in generic CSV mode features_df has the actual CSV columns (including one-hot encoded columns). The FeatureCatalog._SETS monkey-patch in _run_in_thread had already been restored by the finally block before these functions ran.

Fix: Both functions now check session["csv_mode"] — in "generic" mode they use features_df.columns directly. Added available_cols safety filter to prevent KeyError even if some columns are missing.

fix_instructions.docx — 10件の修正適用

Critical (重大)

  1. 2.4 KeyError fix for CSV generic mode (app.py, runner.py):

    • _run_in_thread() now detects csv_mode == "generic" and temporarily overrides FeatureCatalog._SETS so all feature sets use the CSV columns instead of HEA-specific columns. Original _SETS is restored in a finally block.
    • runner.py _phase3_train() now skips feature sets whose columns are missing from the data (logs a warning instead of crashing with KeyError).
  2. 2.1 Validity score weight documentation (user_manual.md §8.1):

    • Updated weights from (0.25, 0.20, 0.20, 0.15, 0.20) to (+0.30, +0.20, +0.30, −0.15, +0.20, −0.10) to match evaluation.py implementation.
  3. 2.2 5要素→6要素 (user_manual.md, plotly_charts.py, report.py):

    • Added MC Penalty (Multicollinearity Penalty) column to validity score DataFrame and markdown report table.
    • Updated all "5要素" references to "6要素" across the manual.
  4. 2.3 Feature set counts (user_manual.md §7.1):

    • FS_SIZE: 10→12 (added B_avg, Vm_avg to listed features).
    • FS_ALL: 16→18.

Medium (中程度)

  1. 3.1 Cr duplication comment (dataset.py):

    • Added clarifying comment explaining that Cr appears intentionally in both _POOL_FCC and _POOL_BCC (used in both Cantor-type FCC and refractory BCC alloys).
  2. 3.2 noise_std deprecation warning (dataset.py):

    • Removed the conditional check if noise_std != 50.0: — now always emits the deprecation warning regardless of the parameter value.
  3. 3.3 Windows resource module compatibility (runner.py):

    • Conditional import: try: import resource; _HAS_RESOURCE = True; except ImportError: _HAS_RESOURCE = False.
    • All resource.getrusage() calls now guarded by if _HAS_RESOURCE: to prevent crashes on Windows.

Minor (軽微)

  1. 4.1 OOD fold-0 limitation note (user_manual.md §8.2):

    • Added note that OOD detection currently uses only fold-0 (design tradeoff for speed vs. computation cost).
  2. 4.2 MC Penalty column description (user_manual.md §5.4):

    • Updated report content table to list MC Penalty column.
  3. 4.3 WF-LASSO/ARD/RF workflows (user_manual.md §7.2):

    • Added WF-LASSO, WF-ARD, WF-RF to the workflow table (previously only WF-LIN/XGB/ENS were documented despite being implemented in earlier PRs).

Security & correctness fixes (Devin Review — prior revision)

  1. XSS fix — missing-column warning: Column names from CSV are now escaped with html_mod.escape() in the missing-value warning section of _analyze_csv_format().
  2. XSS fix — constant-column warning: Column names in the constant-column (variance=0) warning are now escaped.
  3. Data leakage prevention: _handle_generic_csv() now defensively removes the target column from feature_cols (feature_cols = [c for c in feature_cols if c != target_col_clean]) with an error message if the resulting list is empty.
  4. Traceback HTML escaping: _on_csv_upload() error handler wraps traceback.format_exc() output in <pre> with html_mod.escape() to prevent broken HTML from angle brackets like <module>.

Earlier fixes (still in this PR)

  1. Default target uses last numeric column (numeric_cols[-1]) instead of first — matches convention that target is typically the last column.
  2. _on_target_change caches numeric columns in gr.State — no re-reading CSV on target change, no dtype inconsistency from nrows=5.
  3. Numeric dtype filter on feature checkboxes via pd.api.types.is_numeric_dtype().
  4. Docstring return type for _analyze_csv_format corrected to Tuple[pd.DataFrame, str, List[str], List[str]].

Other

  1. GUI version tag updated from PR#126 to PR#137.

GUI Testing

Test: oqmd_l12_compounds.csv pipeline architecture verification (143 rows × 11 columns):

  • CSV書式確認 correctly shows 8 numeric / 3 string columns
  • Constant column warning displayed for natoms (all values = 4)
  • Target changed to delta_e; selected lattice_constant, stability, volume, element_A [文字列], element_B [文字列] as features
  • "CSV読み込み" processed in generic mode (143 samples)
  • Full experiment: 90 runs (WF-LIN, WF-LASSO, WF-ARD, WF-RF, WF-XGB, WF-ENS × 3 splits × 5 folds) completed in 52.4 sec
  • Terminal logs confirmed all 4 pipeline fixes:
    • Fix 1: Effective columns [FS_ALL]: 77 features (after Phase 0 drops) — dropped 2 perfect-collinear element columns
    • Fix 2: Feature selection [FS_ALL]: 77 → 13 features — Lasso reduced feature count
    • Fix 3: No scaler or pca steps logged for WF-RF/WF-XGB (tree models)
    • Fix 4: XGB best params include expanded grid params (subsample, etc.)
  • Zero "エラー" messages in GUI
  • Zero exceptions in terminal logs (only INFO/WARNING)
  • All tabs loaded successfully: Dashboard, Results, OOD Map, Report
  • Results tab shows parity plots for all 6 workflows
  • OOD detection completed without KeyError: 1/29 samples flagged (3.4%)

CSV Format Analysis
Data Summary - Generic Mode

Review & Testing Checklist for Human

Risk Level: 🔴 RED - Significant architectural changes + OOD ensemble algorithm change may impact reproducibility and model performance.

  • OOD per-sample accumulation correctness - The OOD ensemble fix (commit 72abb15) replaced element-wise averaging with per-sample accumulation. Verify the logic is correct: scores should be mapped back to global sample indices (score_sum[te_idx] += res.composite_scores), then averaged only where samples were scored by multiple folds. Test with a multi-seed experiment and verify OOD flags are consistent across runs with the same data.
  • Generalisation score formula change - Fix feat: Implement Nikkei 225 Bollinger Band Trading System #7 changed from min(1.0, 0.5 + 0.5 * geo_mean) to min(1.0, geo_mean). This will change evaluation results for all feature sets. Compare validity scores before/after on a reference dataset to understand the impact. Verify the new formula makes sense (no artificial baseline).
  • ValidityScore configurable weights - Fix feat: Bitcoin Bollinger Band prediction system with asset progression graphs #6 added _weights dict field to ValidityScore dataclass. Verify this doesn't leak into DataFrame exports or JSON serialization. Check to_dataframe() and to_dict() methods don't include _weights.
  • Binary enthalpy JSON loading - Fix feat: extend PDE discovery system to support Burgers equation #11 externalized _DELTA_H_BINARY to JSON. Verify the lazy loading works (_load_delta_h() called on first get_binary_enthalpy()). Check that tuple key conversion from comma-separated strings works correctly. Test with an HEA dataset to ensure dH_mix features are calculated correctly.
  • Dynamic DataFrame columns - Fix feat: implement symbolic regression for physics law discovery #8 uses dataclasses.fields() to dynamically generate column names. Verify to_dataframe() includes exactly the expected scalar fields (workflow, feature_set, split_policy, seed, fold, rmse_, mae_, r2_*, elapsed_sec) and excludes array/dict fields (y_test_true, y_test_pred, test_indices, params, artifacts, _weights).
  • Pipeline architecture fixes - Verify all 4 pipeline fixes work correctly: (1) Phase 0 drops reflected in Phase 3, (2) Lasso feature selection runs in Phase 0.5, (3) tree models don't use StandardScaler/PCA, (4) expanded HPO grids don't crash. Run a full experiment with HEA_ml_numeric_highconf.csv and check terminal logs.
  • AIC/BIC skip threshold - Fix feat: Comprehensive R Shiny ML App for Materials Engineering with 32+ Scenarios #4 skips AIC/BIC when n_features > 30. This is an arbitrary threshold. Test with datasets at the boundary (e.g., 29, 30, 31 features) to verify the skip logic works and doesn't cause issues.

Test Plan

  1. Start GUI: python -m hea_extrapolation_platform gui
  2. Test 12 code review improvements:
    • Upload a CSV with high dimensionality (e.g., oqmd_l12_compounds.csv after one-hot encoding should have 70+ features)
    • Run full experiment (not Quick Mode) to test all 12 fixes
    • Check terminal logs for:
      • Lazy workflow instantiation (no upfront "Loading WF-..." messages)
      • OOD ensemble: "OOD FS_ALL: X/Y flagged (ensemble over N folds)" with N > 1
      • Feature selection: "Feature selection [FS_ALL]: M → K features" with K < M
      • AIC/BIC skip: "AIC skipped: N features > threshold 30" if applicable
    • Verify Results tab shows all 6 workflows (LIN, LASSO, ARD, XGB, ENS, RF)
    • Verify validity scores include MC Penalty column
  3. Test OOD ensemble bug fix:
    • Run an experiment with multiple seeds (--seeds 42,123,456)
    • Check that OOD composite scores are averaged correctly (not element-wise across different sample sets)
    • Verify OOD Map tab loads without errors
  4. Test backward compatibility:
    • Upload HEA_ml_numeric_highconf.csv (HEA mode)
    • Verify HEA mode auto-detected and experiment runs successfully
    • Check dH_mix feature values are correct (verify binary enthalpy JSON loading)
  5. Test binary enthalpy JSON loading:
    • Check terminal logs for "Loaded 179 binary enthalpy entries from data/delta_h_binary.json" message
    • If warning appears, investigate why JSON loading failed
  6. Test ValidityScore weights:
    • Export run registry JSON and verify _weights field is NOT present
    • Check validity score DataFrame doesn't include _weights column
  7. Test generalisation score change:
    • Compare validity scores with a baseline dataset from a previous version
    • Verify scores changed as expected (should generally be higher with new formula)

Notes

  • Link to Devin Session: https://app.devin.ai/sessions/bca8c5188d2c4cd58e2182d85a0e19ad
  • Requested by: @minimumtone
  • Reproducibility warning: The OOD ensemble algorithm change (Fix feat: enhance educational app with materials engineering data and streamlit-aggrid #2 + commit 72abb15) will produce different OOD detection results compared to previous versions, even with the same random seed. This is expected and correct (the old algorithm was buggy).
  • Evaluation score changes: Fix feat: Implement Nikkei 225 Bollinger Band Trading System #7 (generalisation score formula) will change validity score rankings. Re-run reference experiments to establish new baselines.
  • Binary enthalpy externalization: The JSON file uses comma-separated string keys (e.g., "Co,Cr") instead of tuple keys. The conversion happens at load time via tuple(k.split(",")). Verify this works correctly for all 179 entries.
  • Dynamic DataFrame columns risk: The to_dataframe() method filters fields by checking isinstance(getattr(self._runs[0], name, None), (np.ndarray, dict)). This assumes the first run is representative of all runs. If runs have heterogeneous field types (unlikely but possible), some columns may be incorrectly included/excluded.
  • AIC/BIC threshold arbitrariness: The n_features > 30 threshold is a heuristic. Datasets with 31-50 features may still benefit from AIC/BIC selection but will now skip it. Consider making this threshold configurable in a future PR.
  • XGB fallback conditional: The expanded HPO grid conditionally adds XGBoost-specific params only when _XGB_AVAILABLE is True. This prevents crashes on GradientBoostingRegressor fallback, but the fallback itself may have other parameter incompatibilities (e.g., subsample may not work identically). Test on a system without XGBoost if possible.
  • 12 improvements risk summary: Fixes 1-4 are high-impact changes to core pipeline logic. Fixes 5-9 are refactoring/design improvements with lower risk. Fixes 10-12 are documentation/organization with minimal risk. Prioritize testing high-impact fixes.

Open with Devin

- Add _analyze_csv_format(): detects encoding, dtypes, missing values,
  displays per-column analysis table with warnings
- Add _handle_generic_csv(): supports non-HEA CSVs by using user-selected
  numeric columns directly as features (bypasses HEA feature computation)
- Add CSV書式確認 accordion: auto-opens on upload showing format report
- Add 説明因子・目的変数の選択 accordion: Dropdown for target variable,
  CheckboxGroup for explanatory variables, Radio for CSV mode (auto/HEA/Generic)
- Update _handle_csv_upload() to accept selected_features and force_generic params
- Update run_experiment() to pass column role selections to pipeline
- Auto-detect: numeric columns suggested as features, first numeric as target
- Target change auto-excludes from feature list

Tested with both HEA CSV (element columns) and B2_result CSV (generic numeric).

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

… hint

- _on_target_change now uses pd.api.types.is_numeric_dtype() to filter
  columns, preventing string/datetime columns from appearing in feature list
- _analyze_csv_format docstring return type corrected to match actual signature

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…n gr.State

- Fix default_target to use numeric_cols[-1] (last column) instead of [0]
- Store numeric_cols in gr.State during upload to avoid re-reading CSV
- _on_target_change now uses cached list for consistent dtype detection

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

When file_obj is None, return empty [] for csv_numeric_cols_state
to match the 7 declared outputs in the .change() binding.

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 2 commits March 5, 2026 08:29
…banner

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…lumns

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 2 commits March 9, 2026 14:30
…ation error

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…Windows compat

- 2.4 [Critical] Fix KeyError in CSV generic mode: override FeatureCatalog._SETS
  with CSV columns in app.py _run_in_thread(); skip missing FS in runner.py
- 2.1 [Critical] Update user_manual.md section 8.1: weights now match
  implementation (0.30/0.20/0.30/-0.15/0.20/-0.10)
- 2.2 [Critical] Update 5要素→6要素 across user_manual.md, add MC Penalty
  column to plotly_charts.py and report.py
- 2.3 [Critical] Fix FS_SIZE count (10→12) and FS_ALL (16→18) in docs
- 3.1 [Medium] Add clarifying comment for intentional Cr duplication in
  dataset.py element pools
- 3.2 [Medium] Remove conditional check on noise_std deprecation warning —
  always warn regardless of value
- 3.3 [Medium] Conditional import of resource module for Windows compatibility
  in runner.py with _HAS_RESOURCE guard
- 4.1 [Minor] Add OOD fold-0 limitation note to user_manual.md section 8.2
- 4.2 [Minor] Add MC Penalty to section 5.4 report content table
- 4.3 [Minor] Add WF-LASSO/ARD/RF to user_manual.md section 7.2 workflow table

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
@devin-ai-integration devin-ai-integration bot changed the title feat: CSV書式確認 & 説明因子・目的変数の選択機能 feat: CSV書式確認 & 説明因子・目的変数の選択機能 + fix_instructions 10件修正 Mar 9, 2026
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 14 additional findings in Devin Review.

Open in Devin Review

Comment on lines +238 to +244
warnings.warn(
"noise_std is deprecated and ignored; strength noise is now "
"proportional (7 % CV). This parameter will be removed in a "
"future release.",
DeprecationWarning,
stacklevel=2,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Deprecation warning fires unconditionally on every call, not just when parameter is explicitly set

The old code only emitted the DeprecationWarning for noise_std when the caller explicitly passed a non-default value (if noise_std != 50.0). The new code removes this guard and warns on every call to generate_hea_dataset(), even when the caller uses the default. Both call-sites in the codebase (hea_extrapolation_platform/__main__.py:99 and hea_extrapolation_platform/gui/app.py:2837) never pass noise_std, so they will now emit a spurious deprecation warning on every dataset generation, cluttering logs and misleading users into thinking they are using a deprecated API incorrectly.

Suggested change
warnings.warn(
"noise_std is deprecated and ignored; strength noise is now "
"proportional (7 % CV). This parameter will be removed in a "
"future release.",
DeprecationWarning,
stacklevel=2,
)
if noise_std != 50.0:
warnings.warn(
"noise_std is deprecated and ignored; strength noise is now "
"proportional (7 % CV). This parameter will be removed in a "
"future release.",
DeprecationWarning,
stacklevel=2,
)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +2949 to +2962
_orig_sets = dict(_FC._SETS)
try:
csv_cols = list(features_df.columns)
_FC._SETS = {
k: csv_cols
for k in _FC._SETS
}
r, s, o = runner.run(
features_df, features_df, target,
progress_callback=_progress_cb,
selected_workflows=selected_wfs,
)
finally:
_FC._SETS = _orig_sets
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Thread-unsafe mutation of shared FeatureCatalog._SETS in generic CSV mode

FeatureCatalog._SETS is a class-level attribute shared across all threads and users. In generic CSV mode, _run_in_thread (gui/app.py:3033-3046) mutates this shared dict by replacing all entries with the current CSV's columns and restores it in a finally block. If two users run generic CSV experiments concurrently (this Gradio app explicitly supports multi-user sessions via gr.State and app.queue()), one user's override can be read by another user's runner, or one user's restore can clobber another user's override, leading to incorrect feature sets and potential KeyError crashes.

Race condition sequence

Thread A overrides _FC._SETS with columns [a, b, c]
Thread B overrides _FC._SETS with columns [x, y, z]
Thread A's runner.run() reads _SETS → gets [x, y, z] instead of [a, b, c]
Thread A's finally restores _orig_sets (overwriting Thread B's override)
Thread B's runner.run() reads _SETS → gets original HEA sets instead of [x, y, z]

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is a valid concern for multi-user deployments. For the current single-user desktop use case this is not a blocker, but noted for future refactoring if concurrent access becomes a requirement.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. This is a valid concern for concurrent multi-user scenarios. However, the current deployment is single-user (local Gradio app), and _SETS mutation is scoped within try/finally with immediate restore. If multi-user support becomes a requirement, a thread-local or per-request copy pattern would be appropriate. Deferring this to a future PR unless the maintainer wants it addressed now.

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Devin Review raised 3 items. Here's the status:

  1. Empty feature list []None: Already fixed in commit c9d6cd8 (selected_features if selected_features is not None else []). The comment is marked "Outdated".

  2. noise_std unconditional warning: This change was explicitly requested by the fix_instructions document (item 3.2: "Remove condition, always warn about deprecated parameter"). The bot suggests reverting to the conditional check, but this contradicts the user's instructions.

  3. Race condition on FeatureCatalog._SETS: Valid theoretical concern, but the Gradio GUI is single-user (queue serializes requests). The fix_instructions recommended approach B (backup/restore), which is what was implemented. The finally block ensures restoration even on exceptions. For a multi-user deployment, this would need a threading.Lock or the override_feature_columns parameter approach suggested by the bot.

No code changes made — all items are either already addressed or are bot suggestions that don't apply to the current single-user use case.

devin-ai-integration bot and others added 2 commits March 9, 2026 14:54
…eature selection

- _analyze_csv_format now returns constant_cols (variance=0) as 5th element
- _on_csv_upload auto-deselects constant columns (count_A, count_B etc.)
  and ID-like columns (entry_id, index, etc.) from default features
- Users can still manually select these columns if needed
- Feature choices still show all numeric columns except target

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…ection

- Remove hardcoded _id_like set (entry_id, index, etc.)
- Detect ID-like columns by statistical properties:
  integer type + all unique values + monotonically ordered
- Also detect constant columns (variance=0) and all-NaN columns
- Fully generic: works with any CSV regardless of column naming

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 2 commits March 9, 2026 14:59
…se3_train

When a feature set is skipped due to missing columns, jobs referencing
that feature set remained in the list, causing KeyError at fs_arrays[job.fs_name].
Now filter jobs to only those with successfully prepared feature sets.

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…ding

- Feature checkboxes now show ALL columns (numeric + string)
- String columns annotated with [文字列] tag for easy identification
- String columns auto-encoded via pd.get_dummies (One-Hot Encoding)
- Numeric columns used as-is with median NaN imputation
- Updated UI labels and help text to reflect generic column selection
- Target variable dropdown remains numeric-only
- Fully generic: works with any CSV regardless of column types

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 3 commits March 10, 2026 21:36
In generic CSV mode, FeatureCatalog.columns() returns HEA-specific
column names (r_avg, delta_r, etc.) that don't exist in the CSV
features_df. Now checks session['csv_mode'] and uses features_df.columns
directly when in generic mode, falling back to FeatureCatalog only for
HEA mode. Also adds safety filter for available columns.

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…radio errors

All module-level refresh functions (_refresh_dashboard_data,
_refresh_results_data, _refresh_ood_data, _refresh_data_summary,
_refresh_report_data, _export_ood_csv) now catch exceptions internally
and return safe fallback values instead of propagating to Gradio.

This prevents the 'Error' toast that appears in the GUI when any
post-experiment callback fails — errors are now logged to the terminal
via logger.exception() while the UI gracefully degrades.

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…neric CSV runs only FS_ALL

- workflows.py: WorkflowENS._make_member now accepts n_features param (was hardcoded 132)
- runner.py: ExperimentRunner.run() accepts selected_feature_sets to filter feature sets
- app.py: generic CSV mode passes selected_feature_sets=['FS_ALL'] — no redundant FS comparison

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration bot and others added 2 commits March 11, 2026 13:51
Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…ing and user_manual.md TOC

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 2 commits March 11, 2026 14:03
…y docstring enumeration

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
… pipeline, remove StandardScaler+PCA from tree models, expand HPO grids

- Phase 0 dropped columns (constant/collinear) now reflected in Phase 3 & 4
- Lasso feature selection integrated as Phase 0.5 (removes uninformative features)
- WF-XGB/WF-RF: removed StandardScaler+PCA (trees are scale-invariant, PCA destroys physical axes)
- WF-ENS: tree-based members no longer use scaler/PCA
- WF-XGB HPO grid expanded: +subsample, colsample_bytree, min_child_weight
- WF-RF HPO grid expanded: +min_samples_leaf, max_features

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…ostingRegressor fallback

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
@devin-ai-integration devin-ai-integration bot changed the title feat: CSV書式確認 & 説明因子・目的変数の選択機能 + fix_instructions 10件修正 feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes Mar 13, 2026
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 20 additional findings in Devin Review.

Open in Devin Review

Comment on lines +478 to +492
_X_fs = features_all[_cols]
_fs_summary = run_feature_selection(
_X_fs, target,
methods=["Lasso"],
feature_set=_fs_key,
)
_lasso = _fs_summary.results.get("Lasso")
if _lasso and _lasso.selected_features:
_sel = _lasso.selected_features
if len(_sel) >= 2:
logger.info(
"Feature selection [%s]: %d → %d features",
_fs_key, len(_cols), len(_sel),
)
self._effective_cols[_fs_key] = _sel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Feature selection (Phase 0.5) uses full dataset including test data, causing data leakage

Phase 0.5 (runner.py:478-492) runs run_feature_selection on features_all and target, which is the entire dataset — not just training data. The Lasso model in _run_lasso (feature_selection.py:148-149) fits on all samples including future test samples from every cross-validation fold. The features selected this way are then stored in self._effective_cols and used for all Phase 3 training runs and Phase 4 OOD detection. This means feature selection has seen the test targets, invalidating the cross-validation evaluation. Test metrics (RMSE, R², etc.) and validity scores will be optimistically biased because the feature set was chosen with knowledge of test data. This undermines the platform's core purpose of comparing feature set validity and extrapolation capability.

Prompt for agents
In hea_extrapolation_platform/runner.py, the Phase 0.5 feature selection at lines 469-497 calls run_feature_selection(features_all[_cols], target, ...) using the entire dataset. This introduces data leakage because the Lasso model sees test targets. To fix this, either: (1) Remove Phase 0.5 entirely and rely on the learning algorithms' built-in regularization (Lasso, ARD workflows already do feature selection), or (2) Perform feature selection only on a held-out portion of the data that is never used for evaluation — e.g., split off a small portion before Phase 1 folds are computed, or use only the training fold from the first RandomCV split. Option (1) is the safest and simplest fix. If option (2) is chosen, ensure the feature selection split is computed before Phase 1 fold_plan and excluded from all subsequent folds.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point about data leakage. Phase 0.5 currently runs run_feature_selection on features_all (entire dataset) before CV splitting, so the Lasso sees test targets.

Options to fix:

  1. Remove Phase 0.5 entirely — rely on built-in regularization (WF-LASSO, WF-ARD already do L1/ARD selection). Simplest and safest.
  2. Move feature selection inside Phase 3 — run Lasso on each training fold only. Most correct but increases computation (N_folds × N_seeds runs of Lasso).
  3. Use a held-out pre-split — reserve a small portion before fold computation for feature selection only.

Leaving this for the reviewer to decide which approach is preferred, as this involves a design tradeoff between computational cost and statistical rigor. The current implementation was requested by the user as "特徴量選択を学習本体に統合すること" — integrating feature selection into the pipeline.

HIGH priority (fixes 1-4):
- #1: Lazy workflow instantiation via lambda factories in _run_job
- #2: Phase 4 OOD multi-fold ensemble across all seeds
- #3: O(1) hash dict lookup in _collect_ood_errors (replaces O(n) np.array_equal)
- #4: AIC/BIC forward stepwise guard for n_features > 30

MEDIUM priority (fixes 5-9):
- #5: Consolidate safe_array/_safe_np into _utils.py (single source of truth)
- #6: Configurable weights for ValidityScore.total via FeatureValidityEvaluator(weights=...)
- #7: Fix generalisation score formula: min(1.0, geo_mean) instead of min(1.0, 0.5+0.5*geo_mean)
- #8: Dynamic column names in RunRegistry.to_dataframe via dataclasses.fields()
- #9: Initialize OODDetector._actual_k in __init__ (not just fit())

LOW priority (fixes 10-12):
- #10: Unified mlflow/feast/mint interface (passing object implies use)
- #11: Externalize _DELTA_H_BINARY to data/delta_h_binary.json
- #12: Add Phase 0.5 documentation to user_manual.md

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 23 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 NameError: wf_map referenced but variable was renamed to _BUILTIN_FACTORIES

In the _run_job function, the error message on line 263 references wf_map, but the variable was renamed to _BUILTIN_FACTORIES on line 241. When an unknown workflow name is encountered (not built-in and not in mint_configs), the raise KeyError will itself raise a NameError: name 'wf_map' is not defined, masking the original informative error message with a confusing traceback.

(Refers to line 263)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 72abb15: wf_map_BUILTIN_FACTORIES in the error message.

Comment on lines +882 to +896
if len(fold_composites) > 1:
avg_composite = np.mean(fold_composites, axis=0)
avg_threshold = float(np.quantile(avg_composite, 0.95))
is_ood_avg = avg_composite > avg_threshold
n_ood = int(is_ood_avg.sum())
ood_res = OODResult(
mahalanobis_scores=primary_res.mahalanobis_scores,
knn_scores=primary_res.knn_scores,
composite_scores=avg_composite,
is_ood=is_ood_avg,
ood_threshold=avg_threshold,
ood_ratio=n_ood / max(len(avg_composite), 1),
n_total=len(avg_composite),
n_ood=n_ood,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 OOD ensemble averages composite scores across different sample sets

In _phase4_ood, each seed's fold-0 produces a different test set (different sample indices due to different random shuffles). The code collects composite scores from each fold into fold_composites and averages them element-wise with np.mean(fold_composites, axis=0) at runner.py:883. However, position i in each array corresponds to a different sample in each fold, so the element-wise average is meaningless — it averages OOD scores of unrelated samples. The resulting avg_composite is then used to determine is_ood and stored with the primary fold's test indices (ood_test_idx from all_ood_folds[0]), producing incorrect OOD classifications. The _collect_ood_errors call at line 912-913 also uses ood_test_idx from fold 0, so the is_ood mask (computed from averaged scores of mixed samples) is applied to fold-0's test samples, yielding wrong OOD error evaluations.

Prompt for agents
In hea_extrapolation_platform/runner.py, the _phase4_ood method (lines 806-923) attempts to ensemble OOD scores across multiple RandomCV folds from different seeds, but each seed produces a different test set (different sample indices). Element-wise averaging of composite_scores arrays is therefore meaningless since position i in each array refers to a different sample.

Fix options:
1. Revert to single-fold OOD (use only fold 0 of seed 0, as was done before this PR).
2. If ensemble OOD is desired, map scores back to a per-sample array indexed by global sample index (e.g. create an array of shape (n_total_samples,) and accumulate scores per sample across folds, then average only where a sample was scored by multiple folds).
3. Alternatively, only use folds that share the same test indices (which won't happen with different seeds).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in 72abb15: replaced element-wise averaging with per-sample accumulation using global score_sum/score_count arrays indexed by original sample indices. Each fold's scores are mapped back to global indices via score_sum[te_idx] += res.composite_scores, then averaged only where samples overlap (score_sum[primary_te] / score_count[primary_te]). This ensures the averaging is meaningful even when different seeds produce different test sets.

…cumulation

- Fix NameError: error message referenced deleted 'wf_map' variable,
  now correctly references '_BUILTIN_FACTORIES'
- Fix OOD ensemble: different seeds produce different test sets, so
  element-wise averaging was meaningless. Now uses per-sample score
  accumulation via global arrays (score_sum/score_count) indexed by
  original sample indices, averaging only where samples overlap.

Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 21 additional findings in Devin Review.

Open in Devin Review

# (Review: min causes asymmetric improvements to be undervalued)
geo_mean = math.sqrt(rand_improve * block_improve)
return min(1.0, 0.5 + 0.5 * geo_mean)
return min(1.0, geo_mean)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Generalisation score formula produces unreasonably low values after removing base offset

The old formula min(1.0, 0.5 + 0.5 * geo_mean) mapped realistic improvement ratios to the [0.5, 1.0] range. The new formula min(1.0, geo_mean) maps them to [0, 1.0] without any offset. Since geo_mean = sqrt(rand_improve * block_improve) where improvements are relative RMSE reductions (typically 0.01–0.20), the geometric mean is very small. For example, if both splits improve RMSE by 5%, geo_mean = sqrt(0.05 * 0.05) = 0.05 — giving a generalisation score of 0.05 instead of the old 0.525. Since generalisation has the highest weight (0.30), this deflates total validity scores for all non-baseline feature sets, making the ranking less discriminative. The docstring states "both splits improve -> 1" but achieving near-1.0 now requires ~100% RMSE reduction, which is unrealistic.

Suggested change
return min(1.0, geo_mean)
return min(1.0, 0.5 + 0.5 * geo_mean)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was explicitly requested in the user's code review (Fix #7): min(1.0, geo_mean) instead of min(1.0, 0.5 + 0.5 * geo_mean). The bot's analysis is technically correct that typical RMSE improvements (1-20%) will produce low generalisation scores, but the user's intent was to remove the artificial 0.5 baseline bias. If the scores are too deflated in practice, the user can adjust the formula or the weight in _DEFAULT_WEIGHTS (now configurable via Fix #6). Keeping as-is per user's explicit request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GraphRAG アプリケーションのPythonファイルダウンロード機能

1 participant