feat: add column exclusion pipeline#181
Conversation
Greptile SummaryAdded column exclusion pipeline to prevent leakage, identifier, and noisy columns from entering downstream ML phases. The implementation correctly filters problematic columns flagged during data analysis while protecting targets, group columns, and primary input columns. Exclusions are persisted in checkpoints and applied to both training and test datasets. The previous review feedback regarding function length has been addressed through helper function extraction. Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| plexe/workflow.py | Added column exclusion pipeline with helper functions to filter problematic columns while protecting targets, group columns, and primary input. Applies exclusions to both training and test datasets. |
| plexe/models.py | Added excluded_columns field to BuildContext with proper checkpoint serialization/deserialization support. |
| plexe/tools/submission.py | Extended EDA report tool to accept and store problematic_columns parameter with structured metadata (column, reason, category). |
| tests/unit/workflow/test_column_exclusion.py | Comprehensive test suite covering column exclusion, no-op scenarios, protected column preservation, and checkpoint persistence. |
Last reviewed commit: f325112
There was a problem hiding this comment.
Pull request overview
Adds a “column exclusion” step to the ML workflow so that columns flagged as problematic during Phase 1 analysis are removed before downstream phases, while ensuring protected columns (targets/group/primary input) are retained. It also persists exclusions via BuildContext checkpointing and extends the EDA submission payload to include structured problematic-column metadata.
Changes:
- Introduce
_exclude_problematic_columnsand wire it into Phase 1 (analysis) + Phase 2 (use filtered dataset URI; drop excluded columns from separate test datasets). - Extend EDA report submission to accept/store
problematic_columnsin scratch (and addScratchKeys.PROBLEMATIC_COLUMNS). - Persist
excluded_columnsinBuildContextserialization + add unit tests for the exclusion behavior.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
plexe/workflow.py |
Adds the column exclusion implementation and integrates filtered URIs into the workflow + applies exclusions to separate test datasets. |
plexe/tools/submission.py |
Extends save_eda_report to accept/store structured problematic columns and write them into scratch. |
plexe/models.py |
Adds excluded_columns to BuildContext and includes it in checkpoint serialization/deserialization. |
plexe/constants.py |
Adds ScratchKeys.PROBLEMATIC_COLUMNS. |
plexe/agents/ml_task_analyser.py |
Updates the agent prompt to request “Problematic Columns” output from analysis. |
tests/unit/workflow/test_column_exclusion.py |
Adds unit tests for exclusion behavior and checkpoint round-tripping. |
tests/unit/workflow/__init__.py |
Adds package marker for tests.unit.workflow. |
tests/CODE_INDEX.md |
Updates generated test code index with the new test module. |
plexe/CODE_INDEX.md |
Updates generated plexe code index timestamp/entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@greptileai please review again with latest changes |
28b3d17 to
2b480ca
Compare
|
@greptileai please review again with latest changes |
2b480ca to
f325112
Compare
|
@greptileai please review again with latest changes |
This PR adds a column exclusion pipeline that allows data understanding to flag problematic columns and removes them before downstream phases. The aim is to prevent leakage, identifier, and noisy columns from entering split/sample/search/evaluation while preserving protected columns (targets, ranking group column, and primary input column). It also persists exclusions in checkpoints and extends EDA report submission to store structured problematic column metadata.
Testing
poetry run pytest tests/unit/workflow/test_column_exclusion.py tests/unit/test_models.py(the workflow test file is skipped whenpysparkis not installed in the environment)poetry run pytest tests/unit/test_imports.pypoetry run ruff check plexe/agents/ml_task_analyser.py plexe/constants.py plexe/models.py plexe/tools/submission.py plexe/workflow.py tests/unit/workflow/test_column_exclusion.pypoetry run black --check plexe/agents/ml_task_analyser.py plexe/constants.py plexe/models.py plexe/tools/submission.py plexe/workflow.py tests/unit/workflow/test_column_exclusion.py