Skip to content

feat: add column exclusion pipeline#181

Merged
marcellodebernardi merged 3 commits intomainfrom
codex/column-exclusion-pipeline
Mar 2, 2026
Merged

feat: add column exclusion pipeline#181
marcellodebernardi merged 3 commits intomainfrom
codex/column-exclusion-pipeline

Conversation

@marcellodebernardi
Copy link
Copy Markdown
Contributor

This PR adds a column exclusion pipeline that allows data understanding to flag problematic columns and removes them before downstream phases. The aim is to prevent leakage, identifier, and noisy columns from entering split/sample/search/evaluation while preserving protected columns (targets, ranking group column, and primary input column). It also persists exclusions in checkpoints and extends EDA report submission to store structured problematic column metadata.

Testing

  • poetry run pytest tests/unit/workflow/test_column_exclusion.py tests/unit/test_models.py (the workflow test file is skipped when pyspark is not installed in the environment)
  • poetry run pytest tests/unit/test_imports.py
  • poetry run ruff check plexe/agents/ml_task_analyser.py plexe/constants.py plexe/models.py plexe/tools/submission.py plexe/workflow.py tests/unit/workflow/test_column_exclusion.py
  • poetry run black --check plexe/agents/ml_task_analyser.py plexe/constants.py plexe/models.py plexe/tools/submission.py plexe/workflow.py tests/unit/workflow/test_column_exclusion.py

Copilot AI review requested due to automatic review settings February 27, 2026 15:18
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Added column exclusion pipeline to prevent leakage, identifier, and noisy columns from entering downstream ML phases. The implementation correctly filters problematic columns flagged during data analysis while protecting targets, group columns, and primary input columns. Exclusions are persisted in checkpoints and applied to both training and test datasets. The previous review feedback regarding function length has been addressed through helper function extraction.

Confidence Score: 5/5

  • This PR is safe to merge with no identified issues
  • The implementation is well-structured with comprehensive tests covering core functionality and edge cases. Helper functions properly validate inputs, protect critical columns, and handle missing/invalid entries. Checkpoint persistence is correctly implemented, and the data flow from Phase 1 analysis through Phase 2 preparation is sound.
  • No files require special attention

Important Files Changed

Filename Overview
plexe/workflow.py Added column exclusion pipeline with helper functions to filter problematic columns while protecting targets, group columns, and primary input. Applies exclusions to both training and test datasets.
plexe/models.py Added excluded_columns field to BuildContext with proper checkpoint serialization/deserialization support.
plexe/tools/submission.py Extended EDA report tool to accept and store problematic_columns parameter with structured metadata (column, reason, category).
tests/unit/workflow/test_column_exclusion.py Comprehensive test suite covering column exclusion, no-op scenarios, protected column preservation, and checkpoint persistence.

Last reviewed commit: f325112

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a “column exclusion” step to the ML workflow so that columns flagged as problematic during Phase 1 analysis are removed before downstream phases, while ensuring protected columns (targets/group/primary input) are retained. It also persists exclusions via BuildContext checkpointing and extends the EDA submission payload to include structured problematic-column metadata.

Changes:

  • Introduce _exclude_problematic_columns and wire it into Phase 1 (analysis) + Phase 2 (use filtered dataset URI; drop excluded columns from separate test datasets).
  • Extend EDA report submission to accept/store problematic_columns in scratch (and add ScratchKeys.PROBLEMATIC_COLUMNS).
  • Persist excluded_columns in BuildContext serialization + add unit tests for the exclusion behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
plexe/workflow.py Adds the column exclusion implementation and integrates filtered URIs into the workflow + applies exclusions to separate test datasets.
plexe/tools/submission.py Extends save_eda_report to accept/store structured problematic columns and write them into scratch.
plexe/models.py Adds excluded_columns to BuildContext and includes it in checkpoint serialization/deserialization.
plexe/constants.py Adds ScratchKeys.PROBLEMATIC_COLUMNS.
plexe/agents/ml_task_analyser.py Updates the agent prompt to request “Problematic Columns” output from analysis.
tests/unit/workflow/test_column_exclusion.py Adds unit tests for exclusion behavior and checkpoint round-tripping.
tests/unit/workflow/__init__.py Adds package marker for tests.unit.workflow.
tests/CODE_INDEX.md Updates generated test code index with the new test module.
plexe/CODE_INDEX.md Updates generated plexe code index timestamp/entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi marcellodebernardi force-pushed the codex/column-exclusion-pipeline branch from 28b3d17 to 2b480ca Compare February 27, 2026 16:00
@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi marcellodebernardi force-pushed the codex/column-exclusion-pipeline branch from 2b480ca to f325112 Compare March 2, 2026 12:42
@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi marcellodebernardi merged commit 89f8083 into main Mar 2, 2026
13 checks passed
@marcellodebernardi marcellodebernardi deleted the codex/column-exclusion-pipeline branch March 2, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants