feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes#137
feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes#137minimumtone wants to merge 24 commits intomainfrom
Conversation
- Add _analyze_csv_format(): detects encoding, dtypes, missing values, displays per-column analysis table with warnings - Add _handle_generic_csv(): supports non-HEA CSVs by using user-selected numeric columns directly as features (bypasses HEA feature computation) - Add CSV書式確認 accordion: auto-opens on upload showing format report - Add 説明因子・目的変数の選択 accordion: Dropdown for target variable, CheckboxGroup for explanatory variables, Radio for CSV mode (auto/HEA/Generic) - Update _handle_csv_upload() to accept selected_features and force_generic params - Update run_experiment() to pass column role selections to pipeline - Auto-detect: numeric columns suggested as features, first numeric as target - Target change auto-excludes from feature list Tested with both HEA CSV (element columns) and B2_result CSV (generic numeric). Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
… hint - _on_target_change now uses pd.api.types.is_numeric_dtype() to filter columns, preventing string/datetime columns from appearing in feature list - _analyze_csv_format docstring return type corrected to match actual signature Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…n gr.State - Fix default_target to use numeric_cols[-1] (last column) instead of [0] - Store numeric_cols in gr.State during upload to avoid re-reading CSV - _on_target_change now uses cached list for consistent dtype detection Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
When file_obj is None, return empty [] for csv_numeric_cols_state to match the 7 declared outputs in the .change() binding. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…banner Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…lumns Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…ation error Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…Windows compat - 2.4 [Critical] Fix KeyError in CSV generic mode: override FeatureCatalog._SETS with CSV columns in app.py _run_in_thread(); skip missing FS in runner.py - 2.1 [Critical] Update user_manual.md section 8.1: weights now match implementation (0.30/0.20/0.30/-0.15/0.20/-0.10) - 2.2 [Critical] Update 5要素→6要素 across user_manual.md, add MC Penalty column to plotly_charts.py and report.py - 2.3 [Critical] Fix FS_SIZE count (10→12) and FS_ALL (16→18) in docs - 3.1 [Medium] Add clarifying comment for intentional Cr duplication in dataset.py element pools - 3.2 [Medium] Remove conditional check on noise_std deprecation warning — always warn regardless of value - 3.3 [Medium] Conditional import of resource module for Windows compatibility in runner.py with _HAS_RESOURCE guard - 4.1 [Minor] Add OOD fold-0 limitation note to user_manual.md section 8.2 - 4.2 [Minor] Add MC Penalty to section 5.4 report content table - 4.3 [Minor] Add WF-LASSO/ARD/RF to user_manual.md section 7.2 workflow table Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
| warnings.warn( | ||
| "noise_std is deprecated and ignored; strength noise is now " | ||
| "proportional (7 % CV). This parameter will be removed in a " | ||
| "future release.", | ||
| DeprecationWarning, | ||
| stacklevel=2, | ||
| ) |
There was a problem hiding this comment.
🟡 Deprecation warning fires unconditionally on every call, not just when parameter is explicitly set
The old code only emitted the DeprecationWarning for noise_std when the caller explicitly passed a non-default value (if noise_std != 50.0). The new code removes this guard and warns on every call to generate_hea_dataset(), even when the caller uses the default. Both call-sites in the codebase (hea_extrapolation_platform/__main__.py:99 and hea_extrapolation_platform/gui/app.py:2837) never pass noise_std, so they will now emit a spurious deprecation warning on every dataset generation, cluttering logs and misleading users into thinking they are using a deprecated API incorrectly.
| warnings.warn( | |
| "noise_std is deprecated and ignored; strength noise is now " | |
| "proportional (7 % CV). This parameter will be removed in a " | |
| "future release.", | |
| DeprecationWarning, | |
| stacklevel=2, | |
| ) | |
| if noise_std != 50.0: | |
| warnings.warn( | |
| "noise_std is deprecated and ignored; strength noise is now " | |
| "proportional (7 % CV). This parameter will be removed in a " | |
| "future release.", | |
| DeprecationWarning, | |
| stacklevel=2, | |
| ) |
Was this helpful? React with 👍 or 👎 to provide feedback.
| _orig_sets = dict(_FC._SETS) | ||
| try: | ||
| csv_cols = list(features_df.columns) | ||
| _FC._SETS = { | ||
| k: csv_cols | ||
| for k in _FC._SETS | ||
| } | ||
| r, s, o = runner.run( | ||
| features_df, features_df, target, | ||
| progress_callback=_progress_cb, | ||
| selected_workflows=selected_wfs, | ||
| ) | ||
| finally: | ||
| _FC._SETS = _orig_sets |
There was a problem hiding this comment.
🔴 Thread-unsafe mutation of shared FeatureCatalog._SETS in generic CSV mode
FeatureCatalog._SETS is a class-level attribute shared across all threads and users. In generic CSV mode, _run_in_thread (gui/app.py:3033-3046) mutates this shared dict by replacing all entries with the current CSV's columns and restores it in a finally block. If two users run generic CSV experiments concurrently (this Gradio app explicitly supports multi-user sessions via gr.State and app.queue()), one user's override can be read by another user's runner, or one user's restore can clobber another user's override, leading to incorrect feature sets and potential KeyError crashes.
Race condition sequence
Thread A overrides _FC._SETS with columns [a, b, c]
Thread B overrides _FC._SETS with columns [x, y, z]
Thread A's runner.run() reads _SETS → gets [x, y, z] instead of [a, b, c]
Thread A's finally restores _orig_sets (overwriting Thread B's override)
Thread B's runner.run() reads _SETS → gets original HEA sets instead of [x, y, z]
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Acknowledged — this is a valid concern for multi-user deployments. For the current single-user desktop use case this is not a blocker, but noted for future refactoring if concurrent access becomes a requirement.
There was a problem hiding this comment.
Acknowledged. This is a valid concern for concurrent multi-user scenarios. However, the current deployment is single-user (local Gradio app), and _SETS mutation is scoped within try/finally with immediate restore. If multi-user support becomes a requirement, a thread-local or per-request copy pattern would be appropriate. Deferring this to a future PR unless the maintainer wants it addressed now.
|
Devin Review raised 3 items. Here's the status:
No code changes made — all items are either already addressed or are bot suggestions that don't apply to the current single-user use case. |
…eature selection - _analyze_csv_format now returns constant_cols (variance=0) as 5th element - _on_csv_upload auto-deselects constant columns (count_A, count_B etc.) and ID-like columns (entry_id, index, etc.) from default features - Users can still manually select these columns if needed - Feature choices still show all numeric columns except target Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…ection - Remove hardcoded _id_like set (entry_id, index, etc.) - Detect ID-like columns by statistical properties: integer type + all unique values + monotonically ordered - Also detect constant columns (variance=0) and all-NaN columns - Fully generic: works with any CSV regardless of column naming Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…se3_train When a feature set is skipped due to missing columns, jobs referencing that feature set remained in the list, causing KeyError at fs_arrays[job.fs_name]. Now filter jobs to only those with successfully prepared feature sets. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…ding - Feature checkboxes now show ALL columns (numeric + string) - String columns annotated with [文字列] tag for easy identification - String columns auto-encoded via pd.get_dummies (One-Hot Encoding) - Numeric columns used as-is with median NaN imputation - Updated UI labels and help text to reflect generic column selection - Target variable dropdown remains numeric-only - Fully generic: works with any CSV regardless of column types Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
In generic CSV mode, FeatureCatalog.columns() returns HEA-specific column names (r_avg, delta_r, etc.) that don't exist in the CSV features_df. Now checks session['csv_mode'] and uses features_df.columns directly when in generic mode, falling back to FeatureCatalog only for HEA mode. Also adds safety filter for available columns. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…radio errors All module-level refresh functions (_refresh_dashboard_data, _refresh_results_data, _refresh_ood_data, _refresh_data_summary, _refresh_report_data, _export_ood_csv) now catch exceptions internally and return safe fallback values instead of propagating to Gradio. This prevents the 'Error' toast that appears in the GUI when any post-experiment callback fails — errors are now logged to the terminal via logger.exception() while the UI gracefully degrades. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…neric CSV runs only FS_ALL - workflows.py: WorkflowENS._make_member now accepts n_features param (was hardcoded 132) - runner.py: ExperimentRunner.run() accepts selected_feature_sets to filter feature sets - app.py: generic CSV mode passes selected_feature_sets=['FS_ALL'] — no redundant FS comparison Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…ing and user_manual.md TOC Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…y docstring enumeration Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
… pipeline, remove StandardScaler+PCA from tree models, expand HPO grids - Phase 0 dropped columns (constant/collinear) now reflected in Phase 3 & 4 - Lasso feature selection integrated as Phase 0.5 (removes uninformative features) - WF-XGB/WF-RF: removed StandardScaler+PCA (trees are scale-invariant, PCA destroys physical axes) - WF-ENS: tree-based members no longer use scaler/PCA - WF-XGB HPO grid expanded: +subsample, colsample_bytree, min_child_weight - WF-RF HPO grid expanded: +min_samples_leaf, max_features Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
…ostingRegressor fallback Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
| _X_fs = features_all[_cols] | ||
| _fs_summary = run_feature_selection( | ||
| _X_fs, target, | ||
| methods=["Lasso"], | ||
| feature_set=_fs_key, | ||
| ) | ||
| _lasso = _fs_summary.results.get("Lasso") | ||
| if _lasso and _lasso.selected_features: | ||
| _sel = _lasso.selected_features | ||
| if len(_sel) >= 2: | ||
| logger.info( | ||
| "Feature selection [%s]: %d → %d features", | ||
| _fs_key, len(_cols), len(_sel), | ||
| ) | ||
| self._effective_cols[_fs_key] = _sel |
There was a problem hiding this comment.
🔴 Feature selection (Phase 0.5) uses full dataset including test data, causing data leakage
Phase 0.5 (runner.py:478-492) runs run_feature_selection on features_all and target, which is the entire dataset — not just training data. The Lasso model in _run_lasso (feature_selection.py:148-149) fits on all samples including future test samples from every cross-validation fold. The features selected this way are then stored in self._effective_cols and used for all Phase 3 training runs and Phase 4 OOD detection. This means feature selection has seen the test targets, invalidating the cross-validation evaluation. Test metrics (RMSE, R², etc.) and validity scores will be optimistically biased because the feature set was chosen with knowledge of test data. This undermines the platform's core purpose of comparing feature set validity and extrapolation capability.
Prompt for agents
In hea_extrapolation_platform/runner.py, the Phase 0.5 feature selection at lines 469-497 calls run_feature_selection(features_all[_cols], target, ...) using the entire dataset. This introduces data leakage because the Lasso model sees test targets. To fix this, either: (1) Remove Phase 0.5 entirely and rely on the learning algorithms' built-in regularization (Lasso, ARD workflows already do feature selection), or (2) Perform feature selection only on a held-out portion of the data that is never used for evaluation — e.g., split off a small portion before Phase 1 folds are computed, or use only the training fold from the first RandomCV split. Option (1) is the safest and simplest fix. If option (2) is chosen, ensure the feature selection split is computed before Phase 1 fold_plan and excluded from all subsequent folds.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Valid point about data leakage. Phase 0.5 currently runs run_feature_selection on features_all (entire dataset) before CV splitting, so the Lasso sees test targets.
Options to fix:
- Remove Phase 0.5 entirely — rely on built-in regularization (WF-LASSO, WF-ARD already do L1/ARD selection). Simplest and safest.
- Move feature selection inside Phase 3 — run Lasso on each training fold only. Most correct but increases computation (N_folds × N_seeds runs of Lasso).
- Use a held-out pre-split — reserve a small portion before fold computation for feature selection only.
Leaving this for the reviewer to decide which approach is preferred, as this involves a design tradeoff between computational cost and statistical rigor. The current implementation was requested by the user as "特徴量選択を学習本体に統合すること" — integrating feature selection into the pipeline.
HIGH priority (fixes 1-4): - #1: Lazy workflow instantiation via lambda factories in _run_job - #2: Phase 4 OOD multi-fold ensemble across all seeds - #3: O(1) hash dict lookup in _collect_ood_errors (replaces O(n) np.array_equal) - #4: AIC/BIC forward stepwise guard for n_features > 30 MEDIUM priority (fixes 5-9): - #5: Consolidate safe_array/_safe_np into _utils.py (single source of truth) - #6: Configurable weights for ValidityScore.total via FeatureValidityEvaluator(weights=...) - #7: Fix generalisation score formula: min(1.0, geo_mean) instead of min(1.0, 0.5+0.5*geo_mean) - #8: Dynamic column names in RunRegistry.to_dataframe via dataclasses.fields() - #9: Initialize OODDetector._actual_k in __init__ (not just fit()) LOW priority (fixes 10-12): - #10: Unified mlflow/feast/mint interface (passing object implies use) - #11: Externalize _DELTA_H_BINARY to data/delta_h_binary.json - #12: Add Phase 0.5 documentation to user_manual.md Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
There was a problem hiding this comment.
🔴 NameError: wf_map referenced but variable was renamed to _BUILTIN_FACTORIES
In the _run_job function, the error message on line 263 references wf_map, but the variable was renamed to _BUILTIN_FACTORIES on line 241. When an unknown workflow name is encountered (not built-in and not in mint_configs), the raise KeyError will itself raise a NameError: name 'wf_map' is not defined, masking the original informative error message with a confusing traceback.
(Refers to line 263)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 72abb15: wf_map → _BUILTIN_FACTORIES in the error message.
hea_extrapolation_platform/runner.py
Outdated
| if len(fold_composites) > 1: | ||
| avg_composite = np.mean(fold_composites, axis=0) | ||
| avg_threshold = float(np.quantile(avg_composite, 0.95)) | ||
| is_ood_avg = avg_composite > avg_threshold | ||
| n_ood = int(is_ood_avg.sum()) | ||
| ood_res = OODResult( | ||
| mahalanobis_scores=primary_res.mahalanobis_scores, | ||
| knn_scores=primary_res.knn_scores, | ||
| composite_scores=avg_composite, | ||
| is_ood=is_ood_avg, | ||
| ood_threshold=avg_threshold, | ||
| ood_ratio=n_ood / max(len(avg_composite), 1), | ||
| n_total=len(avg_composite), | ||
| n_ood=n_ood, | ||
| ) |
There was a problem hiding this comment.
🔴 OOD ensemble averages composite scores across different sample sets
In _phase4_ood, each seed's fold-0 produces a different test set (different sample indices due to different random shuffles). The code collects composite scores from each fold into fold_composites and averages them element-wise with np.mean(fold_composites, axis=0) at runner.py:883. However, position i in each array corresponds to a different sample in each fold, so the element-wise average is meaningless — it averages OOD scores of unrelated samples. The resulting avg_composite is then used to determine is_ood and stored with the primary fold's test indices (ood_test_idx from all_ood_folds[0]), producing incorrect OOD classifications. The _collect_ood_errors call at line 912-913 also uses ood_test_idx from fold 0, so the is_ood mask (computed from averaged scores of mixed samples) is applied to fold-0's test samples, yielding wrong OOD error evaluations.
Prompt for agents
In hea_extrapolation_platform/runner.py, the _phase4_ood method (lines 806-923) attempts to ensemble OOD scores across multiple RandomCV folds from different seeds, but each seed produces a different test set (different sample indices). Element-wise averaging of composite_scores arrays is therefore meaningless since position i in each array refers to a different sample.
Fix options:
1. Revert to single-fold OOD (use only fold 0 of seed 0, as was done before this PR).
2. If ensemble OOD is desired, map scores back to a per-sample array indexed by global sample index (e.g. create an array of shape (n_total_samples,) and accumulate scores per sample across folds, then average only where a sample was scored by multiple folds).
3. Alternatively, only use folds that share the same test indices (which won't happen with different seeds).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Good catch. Fixed in 72abb15: replaced element-wise averaging with per-sample accumulation using global score_sum/score_count arrays indexed by original sample indices. Each fold's scores are mapped back to global indices via score_sum[te_idx] += res.composite_scores, then averaged only where samples overlap (score_sum[primary_te] / score_count[primary_te]). This ensures the averaging is meaningful even when different seeds produce different test sets.
…cumulation - Fix NameError: error message referenced deleted 'wf_map' variable, now correctly references '_BUILTIN_FACTORIES' - Fix OOD ensemble: different seeds produce different test sets, so element-wise averaging was meaningless. Now uses per-sample score accumulation via global arrays (score_sum/score_count) indexed by original sample indices, averaging only where samples overlap. Co-Authored-By: satoshi minamoto <minimumtone@gmail.com>
| # (Review: min causes asymmetric improvements to be undervalued) | ||
| geo_mean = math.sqrt(rand_improve * block_improve) | ||
| return min(1.0, 0.5 + 0.5 * geo_mean) | ||
| return min(1.0, geo_mean) |
There was a problem hiding this comment.
🔴 Generalisation score formula produces unreasonably low values after removing base offset
The old formula min(1.0, 0.5 + 0.5 * geo_mean) mapped realistic improvement ratios to the [0.5, 1.0] range. The new formula min(1.0, geo_mean) maps them to [0, 1.0] without any offset. Since geo_mean = sqrt(rand_improve * block_improve) where improvements are relative RMSE reductions (typically 0.01–0.20), the geometric mean is very small. For example, if both splits improve RMSE by 5%, geo_mean = sqrt(0.05 * 0.05) = 0.05 — giving a generalisation score of 0.05 instead of the old 0.525. Since generalisation has the highest weight (0.30), this deflates total validity scores for all non-baseline feature sets, making the ranking less discriminative. The docstring states "both splits improve -> 1" but achieving near-1.0 now requires ~100% RMSE reduction, which is unrealistic.
| return min(1.0, geo_mean) | |
| return min(1.0, 0.5 + 0.5 * geo_mean) |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This change was explicitly requested in the user's code review (Fix #7): min(1.0, geo_mean) instead of min(1.0, 0.5 + 0.5 * geo_mean). The bot's analysis is technically correct that typical RMSE improvements (1-20%) will produce low generalisation scores, but the user's intent was to remove the artificial 0.5 baseline bias. If the scores are too deflated in practice, the user can adjust the formula or the weight in _DEFAULT_WEIGHTS (now configurable via Fix #6). Keeping as-is per user's explicit request.
feat: CSV書式確認 & 説明因子・目的変数の選択機能 + pipeline architecture fixes
Summary
Adds CSV format validation and interactive column role assignment to support diverse CSV formats beyond HEA composition data. Also addresses 4 fundamental pipeline architecture gaps: Phase 0→Phase 3 column propagation, feature selection integration, tree model preprocessing cleanup, and hyperparameter search expansion.
New functions:
_analyze_csv_format(): Analyzes uploaded CSV structure (dtypes, missing values, constant columns) and renders an HTML report table_handle_generic_csv(): Handles non-HEA CSVs by using user-selected columns directly as features (bypassescompute_features()). Supports both numeric and string columns — string columns are automatically one-hot encoded viapd.get_dummies.New UI in Config & Run tab:
[文字列]tag for easy identificationModified:
_handle_csv_upload()now acceptsselected_featuresandforce_genericparams; auto-detects HEA vs Generic moderun_experiment()passes column role selections through to the pipelinerun_csv_targetTextbox replaced with dynamic Dropdown populated from CSV columns_refresh_ood_data()and_export_ood_csv()now detect generic CSV mode and usefeatures_df.columnsdirectly instead of HEA-specificFeatureCatalog.columns()Updates since last revision
12 code review improvements (commit 158d67d)
Applied 12 code improvements based on a structured code review, prioritized by impact level.
HIGH priority (Fixes 1–4): Reproducibility & Performance
runner.py_run_job()replaced upfront instantiation of all 6 workflows with lambda factories (_BUILTIN_FACTORIES). Workflows now instantiated on-demand only when selected.runner.pyscore_sum[te_idx] += res.composite_scores, then averaged only where samples overlap. Prevents bias from single fold/seed.runner.py_collect_ood_errors()replaced O(n) linearnp.array_equal()scan with O(1) hash dict lookup (test_indices.tobytes()as key).feature_selection.pyn_features > 30to avoid O(n²) bottleneck on high-dimensional feature sets (e.g., FS_MAGPIE with 132 features).MEDIUM priority (Fixes 5–9): Design & Maintainability
_utils.py(new),runner.py,workflows.pysafe_array()/_safe_np()functions merged into new_utils.pymodule. Single source of truth for pandas→numpy C-contiguous conversion.evaluation.py_DEFAULT_WEIGHTSdict.FeatureValidityEvaluator(weights={...})now accepts custom weights.evaluation.pymin(1.0, 0.5 + 0.5 * geo_mean)tomin(1.0, geo_mean)to remove artificial 0.5 baseline bias.runner.pyRunRegistry.to_dataframe()now usesdataclasses.fields(RunResult)instead of hardcoded column list. Automatically excludes non-scalar fields (ndarray, dict).ood.py_actual_know initialized in__init__()instead of only infit(), preventing AttributeError ifscore()called beforefit().LOW priority (Fixes 10–12): Code Organization & Documentation
runner.pymlflow_tracker=tracker) now automatically impliesuse_mlflow=True. Boolean flags retained for backward compatibility but made redundant.features.py,data/delta_h_binary.json(new)_DELTA_H_BINARYdict moved to JSON file. Lazy-loaded via_load_delta_h()on firstget_binary_enthalpy()call.docs/user_manual.mdBug fixes from Devin Review (commit 72abb15)
NameError:
wf_map→_BUILTIN_FACTORIES(runner.py:263) - Error message referenced deleted variablewf_mapafter Fix GraphRAG アプリケーションのPythonファイルダウンロード機能 #1 renamed it. Now correctly references_BUILTIN_FACTORIES.OOD ensemble averaging bug (
runner.py:868-904) - Fix feat: enhance educational app with materials engineering data and streamlit-aggrid #2 initially averaged composite scores element-wise across folds, but different seeds produce different test sets (different sample indices). Element-wisenp.mean(fold_composites, axis=0)was meaningless — positioniin each array corresponded to a different sample.Fix: Replaced with per-sample accumulation using global arrays:
Pipeline architecture fixes (4 items)
User identified fundamental architectural gaps causing poor model performance. These are systematic fixes, not whack-a-mole patches.
Fix 1: Phase 0 dropped columns → Phase 3 training (
runner.py)Phase 0 multicollinearity diagnostics identified constant and perfectly-collinear columns (
dropped_constant,dropped_perfect) but Phase 3 training still used the originalFeatureCatalog.columns()— the drops were never reflected in the actual training arrays.Fix: Added
_effective_cols: Dict[str, List[str]]toExperimentRunner. After Phase 0 completes, effective columns are computed by subtracting dropped columns. Phase 3 array preparation and Phase 4 OOD detection both now use_effective_colsinstead of rawFeatureCatalog.columns(). Feature sets with missing columns are skipped with a warning instead of crashing.Fix 2: Integrate
run_feature_selection()into runner.py (runner.py)run_feature_selection()was implemented infeature_selection.pybut never called from the main pipeline. High-dimensional/collinear features were passed directly to learning without any reduction.Fix: Added "Phase 0.5" after Phase 0 diagnostics. For each feature set with >3 features, Lasso-based feature selection runs and updates
_effective_cols. In testing withoqmd_l12_compounds.csv, this reduced FS_ALL from 77→13 features. Failures are caught and logged (falls back to all cleaned features).Fix 3: Remove StandardScaler+PCA from tree models (
workflows.py)Tree models (RF, XGB) are scale-invariant and don't need StandardScaler. PCA destroys physically meaningful feature axes and makes feature importance meaningless.
Fix: Removed
("scaler", StandardScaler())and_make_pca_step()fromWorkflowXGB.run()andWorkflowRF.run(). ModifiedWorkflowENS._make_member()to conditionally apply preprocessing: tree-based members (XGB, GBR) get model-only pipelines; linear members (Ridge) keep scaler+PCA. The_make_member()signature now acceptsn_featuresparameter for PCA estimation.Fix 4: Expand hyperparameter search space (
workflows.py)XGB and RF grids were too shallow (3 params × 2-3 values each) for effective exploration.
Fix:
subsample: [0.8, 1.0],colsample_bytree: [0.8, 1.0],min_child_weight: [1, 5]. Increasedn_estimators: [100, 300]andmax_depth: [3, 6].min_samples_leaf: [1, 2],max_features: ["sqrt", 1.0]. Increasedn_estimators: [200, 500]and adjustedmax_depth: [None, 15].IMPORTANT: XGB param grid now conditionally adds
colsample_bytreeandmin_child_weightonly when_XGB_AVAILABLE is True, preventing crashes when using theGradientBoostingRegressorfallback (which doesn't support these params).Error resilience: wrap all GUI refresh callbacks in try/except
User reported that after a long experiment (~16000 sec), the GUI showed "エラー" repeatedly but the terminal had no traceback. Root cause: Gradio internally catches exceptions in callbacks and displays a generic "Error" toast without logging to stderr.
Fix: All 6 post-experiment refresh/callback functions are now wrapped in
try/exceptwithlogger.exception():_refresh_dashboard_data("0", "--", "--", "--", ..., None, None)_refresh_results_datagr.update()× 3 + empty DataFrames + None plots_refresh_ood_dataNoneplot + error message + empty DataFrame_refresh_report_data_refresh_data_summary_export_ood_csvgr.update(value=None, visible=False)This ensures:
logger.exception()instead of being silently swallowedTested: Full 90-run experiment with
oqmd_l12_compounds.csv(143 rows × 11 cols, target=delta_e, all workflows). All tabs loaded successfully after completion with zero errors in GUI and zero exceptions in terminal.String/text column support as explanatory variables
User reported that element columns (e.g.
element_A,element_B) written in alphabetic text could not be selected as features. Now:[文字列]tag in the UIpd.get_dummies) before being passed to the ML pipelineKeyError fix:
_refresh_ood_data/_export_ood_csvin generic CSV modeAfter a successful 540-run experiment, the final tab refresh crashed with:
Root cause:
_refresh_ood_data()and_export_ood_csv()calledFeatureCatalog.columns()which returns HEA-specific column names, but in generic CSV modefeatures_dfhas the actual CSV columns (including one-hot encoded columns). TheFeatureCatalog._SETSmonkey-patch in_run_in_threadhad already been restored by thefinallyblock before these functions ran.Fix: Both functions now check
session["csv_mode"]— in"generic"mode they usefeatures_df.columnsdirectly. Addedavailable_colssafety filter to prevent KeyError even if some columns are missing.fix_instructions.docx — 10件の修正適用
Critical (重大)
2.4 KeyError fix for CSV generic mode (
app.py,runner.py):_run_in_thread()now detectscsv_mode == "generic"and temporarily overridesFeatureCatalog._SETSso all feature sets use the CSV columns instead of HEA-specific columns. Original_SETSis restored in afinallyblock.runner.py_phase3_train()now skips feature sets whose columns are missing from the data (logs a warning instead of crashing withKeyError).2.1 Validity score weight documentation (
user_manual.md§8.1):(0.25, 0.20, 0.20, 0.15, 0.20)to(+0.30, +0.20, +0.30, −0.15, +0.20, −0.10)to matchevaluation.pyimplementation.2.2 5要素→6要素 (
user_manual.md,plotly_charts.py,report.py):2.3 Feature set counts (
user_manual.md§7.1):10→12(added B_avg, Vm_avg to listed features).16→18.Medium (中程度)
3.1 Cr duplication comment (
dataset.py):_POOL_FCCand_POOL_BCC(used in both Cantor-type FCC and refractory BCC alloys).3.2 noise_std deprecation warning (
dataset.py):if noise_std != 50.0:— now always emits the deprecation warning regardless of the parameter value.3.3 Windows resource module compatibility (
runner.py):try: import resource; _HAS_RESOURCE = True; except ImportError: _HAS_RESOURCE = False.resource.getrusage()calls now guarded byif _HAS_RESOURCE:to prevent crashes on Windows.Minor (軽微)
4.1 OOD fold-0 limitation note (
user_manual.md§8.2):4.2 MC Penalty column description (
user_manual.md§5.4):4.3 WF-LASSO/ARD/RF workflows (
user_manual.md§7.2):Security & correctness fixes (Devin Review — prior revision)
html_mod.escape()in the missing-value warning section of_analyze_csv_format()._handle_generic_csv()now defensively removes the target column fromfeature_cols(feature_cols = [c for c in feature_cols if c != target_col_clean]) with an error message if the resulting list is empty._on_csv_upload()error handler wrapstraceback.format_exc()output in<pre>withhtml_mod.escape()to prevent broken HTML from angle brackets like<module>.Earlier fixes (still in this PR)
numeric_cols[-1]) instead of first — matches convention that target is typically the last column._on_target_changecaches numeric columns ingr.State— no re-reading CSV on target change, no dtype inconsistency fromnrows=5.pd.api.types.is_numeric_dtype()._analyze_csv_formatcorrected toTuple[pd.DataFrame, str, List[str], List[str]].Other
PR#126toPR#137.GUI Testing
Test: oqmd_l12_compounds.csv pipeline architecture verification (143 rows × 11 columns):
natoms(all values = 4)delta_e; selectedlattice_constant,stability,volume,element_A [文字列],element_B [文字列]as featuresEffective columns [FS_ALL]: 77 features (after Phase 0 drops)— dropped 2 perfect-collinear element columnsFeature selection [FS_ALL]: 77 → 13 features— Lasso reduced feature countscalerorpcasteps logged for WF-RF/WF-XGB (tree models)Review & Testing Checklist for Human
Risk Level: 🔴 RED - Significant architectural changes + OOD ensemble algorithm change may impact reproducibility and model performance.
score_sum[te_idx] += res.composite_scores), then averaged only where samples were scored by multiple folds. Test with a multi-seed experiment and verify OOD flags are consistent across runs with the same data.min(1.0, 0.5 + 0.5 * geo_mean)tomin(1.0, geo_mean). This will change evaluation results for all feature sets. Compare validity scores before/after on a reference dataset to understand the impact. Verify the new formula makes sense (no artificial baseline)._weightsdict field toValidityScoredataclass. Verify this doesn't leak into DataFrame exports or JSON serialization. Checkto_dataframe()andto_dict()methods don't include_weights._DELTA_H_BINARYto JSON. Verify the lazy loading works (_load_delta_h()called on firstget_binary_enthalpy()). Check that tuple key conversion from comma-separated strings works correctly. Test with an HEA dataset to ensuredH_mixfeatures are calculated correctly.dataclasses.fields()to dynamically generate column names. Verifyto_dataframe()includes exactly the expected scalar fields (workflow, feature_set, split_policy, seed, fold, rmse_, mae_, r2_*, elapsed_sec) and excludes array/dict fields (y_test_true, y_test_pred, test_indices, params, artifacts, _weights).HEA_ml_numeric_highconf.csvand check terminal logs.n_features > 30. This is an arbitrary threshold. Test with datasets at the boundary (e.g., 29, 30, 31 features) to verify the skip logic works and doesn't cause issues.Test Plan
python -m hea_extrapolation_platform guioqmd_l12_compounds.csvafter one-hot encoding should have 70+ features)"OOD FS_ALL: X/Y flagged (ensemble over N folds)"with N > 1"Feature selection [FS_ALL]: M → K features"with K < M"AIC skipped: N features > threshold 30"if applicable--seeds 42,123,456)HEA_ml_numeric_highconf.csv(HEA mode)dH_mixfeature values are correct (verify binary enthalpy JSON loading)"Loaded 179 binary enthalpy entries from data/delta_h_binary.json"message_weightsfield is NOT present_weightscolumnNotes
"Co,Cr") instead of tuple keys. The conversion happens at load time viatuple(k.split(",")). Verify this works correctly for all 179 entries.to_dataframe()method filters fields by checkingisinstance(getattr(self._runs[0], name, None), (np.ndarray, dict)). This assumes the first run is representative of all runs. If runs have heterogeneous field types (unlikely but possible), some columns may be incorrectly included/excluded.n_features > 30threshold is a heuristic. Datasets with 31-50 features may still benefit from AIC/BIC selection but will now skip it. Consider making this threshold configurable in a future PR._XGB_AVAILABLE is True. This prevents crashes onGradientBoostingRegressorfallback, but the fallback itself may have other parameter incompatibilities (e.g.,subsamplemay not work identically). Test on a system without XGBoost if possible.