fix(datalake): resolve column type by majority vote to prevent month-name surnames from becoming DATETIME#28093
Merged
Merged
Conversation
A single date-parseable token (e.g. the surname "May") was enough to flip an entire string column to DATETIME because _TYPE_PRECEDENCE puts datetime64[ns] above str. The fix counts occurrences of each inferred type in the sample and picks the most frequent one, breaking ties with _TYPE_PRECEDENCE. A column with hundreds of plain strings and a handful of month-name values now correctly resolves to STRING. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code Review ✅ ApprovedImplements a frequency-based majority vote for column type inference to prevent ambiguous tokens like month-name surnames from incorrectly triggering DATETIME conversion. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
mohittilala
approved these changes
May 13, 2026
|
Contributor
🟡 Playwright Results — all passed (14 flaky)✅ 4073 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 86 skipped
🟡 14 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
Contributor
|
Failed to cherry-pick changes to the 1.13 branch. |
Contributor
|
Failed to cherry-pick changes to the 1.12.8 branch. |
edg956
added a commit
that referenced
this pull request
May 13, 2026
…28093) A single date-parseable token (e.g. the surname "May") was enough to flip an entire string column to DATETIME because _TYPE_PRECEDENCE puts datetime64[ns] above str. The fix counts occurrences of each inferred type in the sample and picks the most frequent one, breaking ties with _TYPE_PRECEDENCE. A column with hundreds of plain strings and a handful of month-name values now correctly resolves to STRING. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
edg956
added a commit
that referenced
this pull request
May 13, 2026
…28093) A single date-parseable token (e.g. the surname "May") was enough to flip an entire string column to DATETIME because _TYPE_PRECEDENCE puts datetime64[ns] above str. The fix counts occurrences of each inferred type in the sample and picks the most frequent one, breaking ties with _TYPE_PRECEDENCE. A column with hundreds of plain strings and a handful of month-name values now correctly resolves to STRING. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
_resolve_col_typenow counts occurrences of each inferred type in the 1000-value sample and picks the most frequent type, using_TYPE_PRECEDENCEonly to break ties.dateutil-parseable token (e.g. the surname"May") was enough to flip an entire string column toDATETIMEbecause the precedence list rankeddatetime64[ns]abovestrunconditionally.test_fetch_col_types_majority_winssubTest inTestDatalakeUtilscovers: surnames with month-name tokens, pure ISO dates, natural-language date phrases, plain strings, and integer strings.Root cause
Commit
3d6fd71de3replacedmax(parsed_object_datatype_list)(lexicographic) with a precedence-based resolver. The lexicographic approach accidentally worked for string columns ("str" > "datetime64[ns]"lexicographically), but broke dict/list/numeric columns. The new precedence fixed those cases but introduced a regression: one ambiguous value now overrides hundreds of unambiguous ones.Test plan
python -m pytest ingestion/tests/unit/utils/test_datalake.py::TestDatalakeUtils::test_fetch_col_types_majority_wins -vTestDatalakeUtils,TestFetchColTypesMixedTypes,TestFetchColTypesWithParsedObjectstests still pass🤖 Generated with Claude Code