feat: add column exclusion pipeline by marcellodebernardi · Pull Request #181 · plexe-ai/plexe

marcellodebernardi · 2026-02-27T15:18:57Z

This PR adds a column exclusion pipeline that allows data understanding to flag problematic columns and removes them before downstream phases. The aim is to prevent leakage, identifier, and noisy columns from entering split/sample/search/evaluation while preserving protected columns (targets, ranking group column, and primary input column). It also persists exclusions in checkpoints and extends EDA report submission to store structured problematic column metadata.

Testing

poetry run pytest tests/unit/workflow/test_column_exclusion.py tests/unit/test_models.py (the workflow test file is skipped when pyspark is not installed in the environment)
poetry run pytest tests/unit/test_imports.py
poetry run ruff check plexe/agents/ml_task_analyser.py plexe/constants.py plexe/models.py plexe/tools/submission.py plexe/workflow.py tests/unit/workflow/test_column_exclusion.py
poetry run black --check plexe/agents/ml_task_analyser.py plexe/constants.py plexe/models.py plexe/tools/submission.py plexe/workflow.py tests/unit/workflow/test_column_exclusion.py

greptile-apps · 2026-02-27T15:22:59Z

Greptile Summary

Added column exclusion pipeline to prevent leakage, identifier, and noisy columns from entering downstream ML phases. The implementation correctly filters problematic columns flagged during data analysis while protecting targets, group columns, and primary input columns. Exclusions are persisted in checkpoints and applied to both training and test datasets. The previous review feedback regarding function length has been addressed through helper function extraction.

Confidence Score: 5/5

This PR is safe to merge with no identified issues
The implementation is well-structured with comprehensive tests covering core functionality and edge cases. Helper functions properly validate inputs, protect critical columns, and handle missing/invalid entries. Checkpoint persistence is correctly implemented, and the data flow from Phase 1 analysis through Phase 2 preparation is sound.
No files require special attention

Important Files Changed

Filename	Overview
plexe/workflow.py	Added column exclusion pipeline with helper functions to filter problematic columns while protecting targets, group columns, and primary input. Applies exclusions to both training and test datasets.
plexe/models.py	Added `excluded_columns` field to BuildContext with proper checkpoint serialization/deserialization support.
plexe/tools/submission.py	Extended EDA report tool to accept and store `problematic_columns` parameter with structured metadata (column, reason, category).
tests/unit/workflow/test_column_exclusion.py	Comprehensive test suite covering column exclusion, no-op scenarios, protected column preservation, and checkpoint persistence.

_{Last reviewed commit: f325112}

greptile-apps

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

plexe/workflow.py

Copilot

Pull request overview

Adds a “column exclusion” step to the ML workflow so that columns flagged as problematic during Phase 1 analysis are removed before downstream phases, while ensuring protected columns (targets/group/primary input) are retained. It also persists exclusions via BuildContext checkpointing and extends the EDA submission payload to include structured problematic-column metadata.

Changes:

Introduce _exclude_problematic_columns and wire it into Phase 1 (analysis) + Phase 2 (use filtered dataset URI; drop excluded columns from separate test datasets).
Extend EDA report submission to accept/store problematic_columns in scratch (and add ScratchKeys.PROBLEMATIC_COLUMNS).
Persist excluded_columns in BuildContext serialization + add unit tests for the exclusion behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`plexe/workflow.py`	Adds the column exclusion implementation and integrates filtered URIs into the workflow + applies exclusions to separate test datasets.
`plexe/tools/submission.py`	Extends `save_eda_report` to accept/store structured problematic columns and write them into scratch.
`plexe/models.py`	Adds `excluded_columns` to `BuildContext` and includes it in checkpoint serialization/deserialization.
`plexe/constants.py`	Adds `ScratchKeys.PROBLEMATIC_COLUMNS`.
`plexe/agents/ml_task_analyser.py`	Updates the agent prompt to request “Problematic Columns” output from analysis.
`tests/unit/workflow/test_column_exclusion.py`	Adds unit tests for exclusion behavior and checkpoint round-tripping.
`tests/unit/workflow/__init__.py`	Adds package marker for `tests.unit.workflow`.
`tests/CODE_INDEX.md`	Updates generated test code index with the new test module.
`plexe/CODE_INDEX.md`	Updates generated plexe code index timestamp/entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/workflow/test_column_exclusion.py

plexe/workflow.py

marcellodebernardi · 2026-02-27T15:33:00Z

@greptileai please review again with latest changes

marcellodebernardi · 2026-02-27T16:01:00Z

@greptileai please review again with latest changes

marcellodebernardi · 2026-03-02T12:48:44Z

@greptileai please review again with latest changes

Copilot AI review requested due to automatic review settings February 27, 2026 15:18

Copilot started reviewing on behalf of marcellodebernardi February 27, 2026 15:19 View session

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

plexe/workflow.py Outdated Show resolved Hide resolved

Copilot AI reviewed Feb 27, 2026

View reviewed changes

tests/unit/workflow/test_column_exclusion.py Show resolved Hide resolved

plexe/workflow.py Outdated Show resolved Hide resolved

plexe/workflow.py Outdated Show resolved Hide resolved

plexe/workflow.py Outdated Show resolved Hide resolved

marcellodebernardi force-pushed the codex/column-exclusion-pipeline branch from 28b3d17 to 2b480ca Compare February 27, 2026 16:00

marcellodebernardi added 3 commits March 2, 2026 12:40

feat: add column exclusion pipeline

812a192

fix: address review feedback

3d33d94

chore: bump version to 1.3.4

f325112

marcellodebernardi force-pushed the codex/column-exclusion-pipeline branch from 2b480ca to f325112 Compare March 2, 2026 12:42

marcellodebernardi merged commit 89f8083 into main Mar 2, 2026
13 checks passed

marcellodebernardi deleted the codex/column-exclusion-pipeline branch March 2, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add column exclusion pipeline#181

feat: add column exclusion pipeline#181
marcellodebernardi merged 3 commits intomainfrom
codex/column-exclusion-pipeline

marcellodebernardi commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026 •

edited

Loading

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcellodebernardi commented Feb 27, 2026

Uh oh!

marcellodebernardi commented Feb 27, 2026

Uh oh!

marcellodebernardi commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcellodebernardi commented Feb 27, 2026

Testing

Uh oh!

greptile-apps bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcellodebernardi commented Feb 27, 2026

Uh oh!

marcellodebernardi commented Feb 27, 2026

Uh oh!

marcellodebernardi commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Feb 27, 2026 •

edited

Loading