Skip to content

fix(datalake): resolve column type by majority vote to prevent month-name surnames from becoming DATETIME#28093

Merged
edg956 merged 1 commit into
mainfrom
fix/datalake-utils
May 13, 2026
Merged

fix(datalake): resolve column type by majority vote to prevent month-name surnames from becoming DATETIME#28093
edg956 merged 1 commit into
mainfrom
fix/datalake-utils

Conversation

@edg956
Copy link
Copy Markdown
Contributor

@edg956 edg956 commented May 13, 2026

Summary

  • _resolve_col_type now counts occurrences of each inferred type in the 1000-value sample and picks the most frequent type, using _TYPE_PRECEDENCE only to break ties.
  • Previously, a single dateutil-parseable token (e.g. the surname "May") was enough to flip an entire string column to DATETIME because the precedence list ranked datetime64[ns] above str unconditionally.
  • A new test_fetch_col_types_majority_wins subTest in TestDatalakeUtils covers: surnames with month-name tokens, pure ISO dates, natural-language date phrases, plain strings, and integer strings.

Root cause

Commit 3d6fd71de3 replaced max(parsed_object_datatype_list) (lexicographic) with a precedence-based resolver. The lexicographic approach accidentally worked for string columns ("str" > "datetime64[ns]" lexicographically), but broke dict/list/numeric columns. The new precedence fixed those cases but introduced a regression: one ambiguous value now overrides hundreds of unambiguous ones.

Test plan

  • python -m pytest ingestion/tests/unit/utils/test_datalake.py::TestDatalakeUtils::test_fetch_col_types_majority_wins -v
  • Existing TestDatalakeUtils, TestFetchColTypesMixedTypes, TestFetchColTypesWithParsedObjects tests still pass

🤖 Generated with Claude Code

A single date-parseable token (e.g. the surname "May") was enough to
flip an entire string column to DATETIME because _TYPE_PRECEDENCE puts
datetime64[ns] above str. The fix counts occurrences of each inferred
type in the sample and picks the most frequent one, breaking ties with
_TYPE_PRECEDENCE. A column with hundreds of plain strings and a handful
of month-name values now correctly resolves to STRING.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@edg956 edg956 requested a review from a team as a code owner May 13, 2026 13:11
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 13, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 13, 2026

Code Review ✅ Approved

Implements a frequency-based majority vote for column type inference to prevent ambiguous tokens like month-name surnames from incorrectly triggering DATETIME conversion. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@edg956 edg956 enabled auto-merge (squash) May 13, 2026 13:47
@sonarqubecloud
Copy link
Copy Markdown

@edg956 edg956 merged commit 89c6d21 into main May 13, 2026
73 of 103 checks passed
@edg956 edg956 deleted the fix/datalake-utils branch May 13, 2026 16:01
@github-actions
Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (14 flaky)

✅ 4073 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 759 0 8 8
🟡 Shard 3 782 0 2 7
🟡 Shard 4 789 0 1 18
🟡 Shard 5 708 0 1 41
🟡 Shard 6 737 0 1 8
🟡 14 flaky test(s) (passed on retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/ColumnBulkOperations.spec.ts › should filter by entity type (Table) (shard 2, 1 retry)
  • Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Add test case to existing Bundle Suite (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Workflows/WorkflowOssRestrictions.spec.ts › editing a form field and saving node config then workflow fires PUT API with updated data (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Table (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@github-actions
Copy link
Copy Markdown
Contributor

Failed to cherry-pick changes to the 1.13 branch.
Please cherry-pick the changes manually.
You can find more details here.

@github-actions
Copy link
Copy Markdown
Contributor

Failed to cherry-pick changes to the 1.12.8 branch.
Please cherry-pick the changes manually.
You can find more details here.

edg956 added a commit that referenced this pull request May 13, 2026
…28093)

A single date-parseable token (e.g. the surname "May") was enough to
flip an entire string column to DATETIME because _TYPE_PRECEDENCE puts
datetime64[ns] above str. The fix counts occurrences of each inferred
type in the sample and picks the most frequent one, breaking ties with
_TYPE_PRECEDENCE. A column with hundreds of plain strings and a handful
of month-name values now correctly resolves to STRING.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
edg956 added a commit that referenced this pull request May 13, 2026
…28093)

A single date-parseable token (e.g. the surname "May") was enough to
flip an entire string column to DATETIME because _TYPE_PRECEDENCE puts
datetime64[ns] above str. The fix counts occurrences of each inferred
type in the sample and picks the most frequent one, breaking ties with
_TYPE_PRECEDENCE. A column with hundreds of plain strings and a handful
of month-name values now correctly resolves to STRING.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants