Skip to content

fix: StringDType / <U interop and left join null handling (v2.2.1)#25

Merged
dsfulf merged 16 commits intomainfrom
bugfix/string-dtype-interop
Mar 25, 2026
Merged

fix: StringDType / <U interop and left join null handling (v2.2.1)#25
dsfulf merged 16 commits intomainfrom
bugfix/string-dtype-interop

Conversation

@dsfulf
Copy link
Member

@dsfulf dsfulf commented Mar 24, 2026

Summary

  • Join dtype validation: compare base types (_reduce_dtype) instead of raw numpy dtypes — StringDType vs <U, <U8 vs <U12 no longer reject
  • Union dtype validation: same base-type comparison
  • update_dtypes_inplace: preserve raw numpy dtypes for casting instead of round-tripping through formatted string labels — fixes silent no-op when converting between <U and StringDType
  • Left join null-fill: string columns use StringDType(na_object=None) instead of object, float columns use NaN instead of object, int/bool fall back to object with a warning
  • Version bump to 2.2.1

Motivation

pyodbc returns Python strings as object arrays, which tafra's ObjectFormatter converts to StringDType. Meanwhile, operations like .upper() on string columns produce fixed-width <U arrays. Joining these two tafras raised TypeError: dtype StringDType() does not match dtype <U12.

Test plan

  • 108 tests pass (3 new: test_mixed_string_dtypes, test_update_dtypes_string_conversion, test_left_join_object_fallback_warning)
  • Verified: inner join, left join (both directions), union, concat with mixed StringDType/<U/<U widths
  • Verified: left join null-fill preserves string dtype, uses NaN for float, warns on int→object
  • Verified: update_dtypes_inplace int→float, float→int conversions still work

🤖 Generated with Claude Code

dsfulf and others added 10 commits March 24, 2026 17:56
- Join _validate_dtypes: compare base types via _reduce_dtype instead of
  raw numpy dtypes, so StringDType vs <U and <U8 vs <U12 are accepted
- Union.apply: same base-type comparison for column dtype validation
- update_dtypes_inplace: preserve raw numpy dtypes for casting instead of
  round-tripping through _format_dtype string labels (fixes silent no-op
  when both <U and StringDType map to 'str')
- Add test_mixed_string_dtypes covering all interop scenarios
- Bump version to 2.2.1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- String columns use StringDType(na_object=None) instead of object
- Float columns use NaN instead of object
- Int/bool columns still fall back to object (can't represent None)
- Both C-path and fallback path updated
- Tests updated for new null-fill behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Warns once per column when an int/bool column has unmatched rows and
must fall back to object dtype. String and float columns are handled
natively (StringDType(na_object=None) and NaN respectively) and do
not warn.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ning

- test_update_dtypes_string_conversion: verifies <U/StringDType handling,
  int->float, float->int conversions
- test_left_join_object_fallback_warning: asserts warning fires for int
  columns, does NOT fire for string/float columns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- .readthedocs.yml: replaced by GitHub Pages / MkDocs
- MANIFEST.in: unnecessary with setuptools>=64 + pyproject.toml
- dev-requirements.txt: replaced by pyproject.toml + CI
- test/test.bat, test/test.sh: referenced flake8 + sphinx, both removed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- E722: bare except → except StopIteration
- E501: wrap long StringDType/read_csv lines
- E712: == True/False → truthiness checks
- F541: remove extraneous f-string prefixes
- E401: split multi-imports

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cast const npy_int64** to void* before free() to avoid
"different 'const' qualifiers" warning on MSVC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use cast() for the object-dtype branch in left join null-fill,
matching the pattern used elsewhere in the codebase. Add type: ignore
for StringDType(na_object=) which numpy stubs don't support yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add type: ignore[assignment] on None assignments to ndarray (needed
  by Python 3.12 numpy stubs)
- Keep type: ignore[call-arg] on StringDType(na_object=) (needed by
  Python 3.10 numpy stubs)
- Disable warn_unused_ignores: numpy stubs differ across versions,
  making per-line ignores impossible to satisfy universally

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve dtype interoperability across joins/unions (notably StringDType vs NumPy <U...) and to improve left-join null filling semantics for string/float columns, along with a version bump to 2.2.1.

Changes:

  • Relax join/union dtype validation to compare base types via _reduce_dtype rather than exact NumPy dtypes.
  • Adjust left-join null fill behavior to preserve string dtypes (StringDType(na_object=None)) and use NaN for floats, with warnings + object fallback for int/bool.
  • Rework update_dtypes_inplace to preserve raw dtypes for casting (and bump version/changelog/tests accordingly).

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tafra/group.py Base-type dtype validation for joins/unions; new left-join null-fill behavior + warnings.
tafra/base.py update_dtypes_inplace now keeps “raw” target dtypes for casting instead of round-tripping through formatted labels.
test/test_tafra.py Updates many StringDType array constructions and adds tests for mixed string dtypes, update_dtypes_inplace, and left-join warnings/nulls.
docs/changelog.md Adds 2.2.1 release notes describing dtype interop changes.
pyproject.toml Version bump to 2.2.1; mypy config tweak (warn_unused_ignores = false).
recipe/meta.yaml Version bump to 2.2.1 for packaging recipe.
tafra/_accel.c Minor free-cast tweak for col_data.
test/bench_vs_pandas_vs_polars.py Minor print formatting change.
test/bench_tafra.py Minor timer label formatting change.
test/test.sh Removed legacy test runner script.
test/test.bat Removed legacy test runner script.
dev-requirements.txt Removed legacy dev requirements file.
MANIFEST.in Removed manifest file.
.readthedocs.yml Removed Read the Docs config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +12
instead of round-tripping through formatted string labels -- fixes silent
no-op when converting between `<U` and `StringDType`

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog claims update_dtypes_inplace fixes a silent no-op when converting between <U and StringDType, but the current implementation still skips casting for any 'str''str' conversion. Either adjust the implementation to actually perform the conversion, or update this changelog entry to reflect the current behavior (e.g., only metadata updates / conversion intentionally skipped).

Suggested change
instead of round-tripping through formatted string labels -- fixes silent
no-op when converting between `<U` and `StringDType`
instead of round-tripping through formatted string labels; note that
`'str'``'str'` conversions (including `<U``StringDType`) still skip data
casting and only update dtype metadata

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +234
yield {'y': np.array(
['one', 'two', 'one', 'two', 'one', 'two'],
dtype=np.dtypes.StringDType(),
)}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These continuation lines are misindented relative to the surrounding code, which hurts readability and makes future diffs noisier. Align the list/dtype arguments under the opening np.array( (as done in the iterator() block just above).

Copilot uses AI. Check for mistakes.
Comment on lines +246 to +249
yield ['y', np.array(
['one', 'two', 'one', 'two', 'one', 'two'],
dtype=np.dtypes.StringDType(),
)]
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same indentation issue here: the arguments to np.array( are not indented under the opening paren, unlike the rest of the file. Re-indent to match the prevailing style used elsewhere in this test module.

Copilot uses AI. Check for mistakes.
tafra/base.py Outdated
Comment on lines +1782 to +1785
# Skip if both are string types — StringDType and <U are
# compatible and casting between them can fail or lose data.
if (self._reduce_dtype(current_dtype) == 'str'
and self._reduce_dtype(target_dtype) == 'str'):
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_dtypes_inplace now unconditionally skips casting when both current and target dtypes reduce to 'str'. That means explicit conversions like <U10StringDType(na_object=None) (or StringDType<U...>/str) will still be silently ignored, which contradicts the PR description/changelog about fixing the no-op for <UStringDType. Consider either performing the requested cast when the concrete dtypes differ, or emitting a warning/raising when skipping so callers don't think the conversion happened.

Suggested change
# Skip if both are string types — StringDType and <U are
# compatible and casting between them can fail or lose data.
if (self._reduce_dtype(current_dtype) == 'str'
and self._reduce_dtype(target_dtype) == 'str'):
# Only skip when both are already the same concrete string dtype.
# This avoids unnecessary casts but still allows explicit
# conversions between different string dtypes (e.g. <U ↔ StringDType).
if (self._reduce_dtype(current_dtype) == 'str'
and self._reduce_dtype(target_dtype) == 'str'
and current_dtype == target_dtype):

Copilot uses AI. Check for mistakes.
tafra/group.py Outdated
Comment on lines +1246 to +1277
@@ -1253,9 +1256,29 @@ def apply(self, left_t: 'Tafra', right_t: 'Tafra') -> 'Tafra':
else:
# right column: fill matched rows, None for unmatched
if has_null:
out = np.empty(len(li), dtype=object)
out[matched] = right_t._data[c][ri[matched]]
out[~matched] = None
col_kind = right_t._data[c].dtype.kind
if col_kind in ('T', 'U', 'S'):
# String types: use StringDType(na_object=None)
out = np.empty(len(li), dtype=np.dtypes.StringDType(na_object=None)) # type: ignore[call-arg]
out[matched] = right_t._data[c][ri[matched]]
out[~matched] = None # type: ignore[assignment]
dtypes[c] = 'str'
elif col_kind == 'f':
# Float types: use NaN for missing
out = np.full(len(li), np.nan, dtype=right_t._data[c].dtype)
out[matched] = right_t._data[c][ri[matched]]
else:
# int, bool, etc.: fall back to object
warnings.warn(
f"Left join: column '{c}' (dtype {right_t._data[c].dtype}) "
f"has unmatched rows and no native null representation. "
f"Dtype has been cast to object. "
f"Use .astype(float) if NaN semantics are needed.",
stacklevel=4,
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left-join null fill treats any non-('T','U','S','f') dtype as having no native nulls and falls back to object + warning. This incorrectly includes datetime64/timedelta64 (kinds 'M'/'m'), which do have a native missing value (NaT) and should be preserved similarly to floats (fill with NaT instead of casting to object/warning).

Copilot uses AI. Check for mistakes.
Comment on lines +1311 to +1321
col_kind = right_t._data[column].dtype.kind
if col_kind not in ('T', 'U', 'S', 'f') and dtypes[column] != 'object':
warnings.warn(
f"Left join: column '{column}' "
f"(dtype {right_t._data[column].dtype}) "
f"has unmatched rows and no native null "
f"representation. Dtype has been cast to "
f"object. Use .astype(float) if NaN "
f"semantics are needed.",
stacklevel=2,
)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-vectorized left-join path, the fallback-to-object warning logic also treats datetime64/timedelta64 columns (kinds 'M'/'m') as having no native null representation. To avoid unnecessary dtype changes and warnings, handle these kinds by preserving the original dtype and using NaT for unmatched rows.

Copilot uses AI. Check for mistakes.
dsfulf and others added 2 commits March 24, 2026 21:38
Dtype validation:
- Join/Union validate on _dtypes metadata (user intent), not raw array
  dtypes. Since _format_dtype collapses all string variants to 'str',
  StringDType vs <U no longer rejects.
- Restore metadata consistency check in joins (was removed, now back)

update_dtypes_inplace:
- 'str' label maps to StringDType(na_object=None) for nullable strings
- _from_init flag skips this during construction to preserve original dtype

Left join null-fill:
- String columns: StringDType(na_object=None) with None
- Float columns: original dtype with NaN
- Datetime/timedelta: original dtype with NaT
- Int/bool/bytes: object with None + warning
- Byte strings (kind='S') excluded from string path
- has_null flag replaces O(n) `None in v` scan
- stacklevel=3 on both C-path and fallback warnings

Test fixes:
- Restore TypeError expectation for corrupted _dtypes in joins
- Add datetime NaT null-fill test
- Fix test indentation (DictIterable, SequenceIterable)

Docs:
- CLAUDE.md: document dtype intent principle, update build/architecture
- README.md: add dtype metadata and left join null handling sections
- Changelog: updated to reflect all behavioral changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
30 tests in TestStringDtypeInterop covering all v2.2.1 bugfixes:
- StringDType vs <U joins (both directions)
- <U8 vs <U12 width mismatch joins
- Union with mixed string dtypes
- update_dtypes_inplace 'str' → StringDType(na_object=None)
- Construction preserves original dtype
- Left join null-fill for every dtype kind:
  string, float, float32, datetime64, timedelta64, int, bool, bytes
- Object fallback warning (fires for int/bool/bytes, not str/float/dt)
- Corrupted _dtypes metadata rejected in joins
- Matching metadata accepted despite different array representations
- Join key dtype combinations: int64, float64, StringDType, <U
- Cross-type join rejection: int vs float, int vs string
- Concat mixed string dtypes
- Full-match left join preserves original dtypes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tafra/group.py Outdated
# datetime64/timedelta64: use NaT for missing
out = np.empty(len(li), dtype=right_t._data[c].dtype)
out[matched] = right_t._data[c][ri[matched]]
out[~matched] = np.datetime64('NaT')
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For datetime64/timedelta64 columns, missing values are filled with np.datetime64('NaT') regardless of whether the target dtype is datetime64[...] or timedelta64[...]. Using a datetime64 scalar for timedelta64 relies on cross-dtype casting and can be fragile across NumPy versions/units. Prefer constructing a NaT scalar matching the column dtype (e.g., via np.array('NaT', dtype=right_t._data[c].dtype).item()) and assign that.

Suggested change
out[~matched] = np.datetime64('NaT')
nat_value = np.array('NaT', dtype=right_t._data[c].dtype).item()
out[~matched] = nat_value

Copilot uses AI. Check for mistakes.
tafra/base.py Outdated
Comment on lines 1761 to 1762
tafra: Tafra | None
The updated `Tafra`.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring's Returns section says tafra: Tafra | None ("The updated Tafra"), but update_dtypes_inplace is an in-place mutator that returns None. Please update the docstring to match the actual return value to avoid confusing callers.

Suggested change
tafra: Tafra | None
The updated `Tafra`.
None
This method mutates the `Tafra` in place.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +49
**`Tafra`** (`tafra/base.py`) — The dataframe class. Wraps `dict[str, np.ndarray]` where every column must share the same row count. Tracks dtypes separately in `_dtypes` — this metadata represents the user's declared intent for column types and controls dtype validation in joins, unions, and dtype updates. Implements dict-like access (keys, values, items, get, update). Decorated with `@dataclass`. String columns use numpy `StringDType(na_object=None)` to support `None` values; `_dtypes` stores `'str'` for all string variants (`StringDType`, `<U`).

**Dtype design principle:** `_dtypes` metadata is the user's intent. Join/union validation compares `_dtypes` labels, not raw numpy dtypes. `update_dtypes_inplace` is how users change their intent — it updates the label AND casts the array. The `'str'` label maps to `StringDType(na_object=None)`. If assigning directly to `_data`, you must call `_coalesce_dtypes()` afterwards.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claims Tafra string columns generally use StringDType(na_object=None), but current construction/assignment paths still create plain np.dtypes.StringDType() (e.g., _ensure_valid uses StringDType() and ObjectFormatter/CSVReader also use StringDType()). Either update those code paths to consistently use na_object=None, or clarify here that na_object=None is used only for specific cases (e.g. left-join null fill / update_dtypes_inplace({'x':'str'})).

Copilot uses AI. Check for mistakes.
CLAUDE.md Outdated
Comment on lines +70 to +71
- `formatter.py` — `ObjectFormatter` for custom dtype parsing (e.g., Decimal → float); auto-converts object arrays of strings to `StringDType(na_object=None)`
- `csvreader.py` — CSV reader with type inference; string columns produce `StringDType(na_object=None)`
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These bullets say ObjectFormatter and CSVReader produce StringDType(na_object=None), but the current implementations still cast to np.dtypes.StringDType() without na_object=None. Please align the docs with the code (or update tafra/formatter.py and tafra/csvreader.py if the new dtype is intended everywhere).

Suggested change
- `formatter.py``ObjectFormatter` for custom dtype parsing (e.g., Decimal → float); auto-converts object arrays of strings to `StringDType(na_object=None)`
- `csvreader.py` — CSV reader with type inference; string columns produce `StringDType(na_object=None)`
- `formatter.py``ObjectFormatter` for custom dtype parsing (e.g., Decimal → float); auto-converts object arrays of strings to `np.dtypes.StringDType()`
- `csvreader.py` — CSV reader with type inference; string columns produce `np.dtypes.StringDType()`

Copilot uses AI. Check for mistakes.
dsfulf and others added 2 commits March 24, 2026 22:05
- base.py _ensure_valid: string scalar creates nullable StringDType
- base.py describe(): stat column uses nullable StringDType
- base.py update_dtypes_inplace: fix docstring return type (None, not Tafra)
- formatter.py: object arrays of strings → StringDType(na_object=None)
- csvreader.py: CSV string columns → StringDType(na_object=None)
- group.py: dtype-matched NaT for datetime/timedelta null-fill
- CLAUDE.md: docs now accurately reflect na_object=None usage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pyproject.toml Outdated
Comment on lines +96 to +98
# Disabled: numpy stubs differ across Python versions, causing
# type: ignore comments to be needed in some envs but not others.
warn_unused_ignores = false
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn_unused_ignores is disabled globally, which reduces mypy’s ability to catch stale/accidental # type: ignore usage across the entire codebase. Consider keeping it enabled globally and instead using a narrower override (e.g., per-module/file where numpy-stubs variance forces conditional ignores), so real unused ignores elsewhere are still surfaced.

Suggested change
# Disabled: numpy stubs differ across Python versions, causing
# type: ignore comments to be needed in some envs but not others.
warn_unused_ignores = false
# Note: numpy stubs can differ across Python versions, which may cause
# type: ignore comments to be needed in some envs but not others.
warn_unused_ignores = true

Copilot uses AI. Check for mistakes.
dsfulf and others added 2 commits March 24, 2026 22:23
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…numpy stubs

Keep warn_unused_ignores=true globally so stale type: ignore comments
are caught. Add per-module override for tafra.base, tafra.group,
tafra.formatter, tafra.csvreader where numpy stub variance across
Python versions requires conditional ignores.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

tafra/group.py:352

  • Union._validate zips left._data.items() with left._dtypes.items(), which can pair the wrong dtype with a column if dict insertion orders differ (e.g., user passes dtypes in a different key order than data). This can lead to incorrect dtype validation and misleading error messages. Iterate by column name (e.g., for col, dtype in left._dtypes.items(): value = left._data[col] ...) instead of zipping two dicts.
        for (data_column, value), (dtype_column, dtype) \
                in zip(left._data.items(), left._dtypes.items()):

            if data_column not in right._data or dtype_column not in right._dtypes:
                raise TypeError(
                    f'This `Tafra` column `{data_column}` does not exist in right `Tafra`.')

            # Compare user-declared dtypes (metadata = intent).
            # _format_dtype collapses string variants to 'str'.
            elif dtype != right._dtypes[dtype_column]:
                raise TypeError(
                    f'This `Tafra` column `{data_column}` dtype `{dtype}` '
                    f'does not match right `Tafra` dtype `{right._dtypes[dtype_column]}`.')

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1179 to +1185
def test_mixed_string_dtypes() -> None:
"""StringDType vs <U should interop in joins, unions, and concat."""
left = Tafra({
'key': np.array(['a', 'b', 'c'], dtype=np.dtypes.StringDType()),
'val': np.array([1, 2, 3]),
})
right = Tafra({
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test_mixed_string_dtypes covers many of the same scenarios that are then repeated (often more granularly) in TestStringDtypeInterop below (joins, union, concat, left-join null handling). This duplication significantly increases test size/runtime and maintenance burden; consider keeping one canonical suite (either the single function or the class) and removing/reducing overlap.

Copilot uses AI. Check for mistakes.
Comment on lines +1555 to +1586

def test_left_join_warns_on_object_fallback(self) -> None:
import warnings
left = Tafra({'k': np.array([1, 2, 3]), 'v': np.array([10, 20, 30])})
right = Tafra({
'k': np.array([2, 3]),
'count': np.array([100, 200]),
})
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter('always')
left.left_join(right, on=[('k', 'k', '==')])
msgs = [x for x in w if 'cast to object' in str(x.message)]
assert len(msgs) == 1
assert "'count'" in str(msgs[0].message)

def test_left_join_no_warning_for_native_nulls(self) -> None:
import warnings
left = Tafra({'k': np.array([1, 2, 3]), 'v': np.array([10, 20, 30])})
right = Tafra({
'k': np.array([2, 3]),
'name': np.array(['b', 'c'],
dtype=np.dtypes.StringDType()),
'score': np.array([1.5, 2.5]),
'ts': np.array(['2024-01-01', '2024-06-15'],
dtype='datetime64[D]'),
})
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter('always')
left.left_join(right, on=[('k', 'k', '==')])
msgs = [x for x in w if 'cast to object' in str(x.message)]
assert len(msgs) == 0

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The left-join object fallback warning is tested twice (test_left_join_object_fallback_warning and TestStringDtypeInterop.test_left_join_warns_on_object_fallback), along with similar “no warning” coverage. Consider deduplicating to a single test location to avoid divergence over time (message matching, stacklevel expectations, etc.).

Suggested change
def test_left_join_warns_on_object_fallback(self) -> None:
import warnings
left = Tafra({'k': np.array([1, 2, 3]), 'v': np.array([10, 20, 30])})
right = Tafra({
'k': np.array([2, 3]),
'count': np.array([100, 200]),
})
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter('always')
left.left_join(right, on=[('k', 'k', '==')])
msgs = [x for x in w if 'cast to object' in str(x.message)]
assert len(msgs) == 1
assert "'count'" in str(msgs[0].message)
def test_left_join_no_warning_for_native_nulls(self) -> None:
import warnings
left = Tafra({'k': np.array([1, 2, 3]), 'v': np.array([10, 20, 30])})
right = Tafra({
'k': np.array([2, 3]),
'name': np.array(['b', 'c'],
dtype=np.dtypes.StringDType()),
'score': np.array([1.5, 2.5]),
'ts': np.array(['2024-01-01', '2024-06-15'],
dtype='datetime64[D]'),
})
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter('always')
left.left_join(right, on=[('k', 'k', '==')])
msgs = [x for x in w if 'cast to object' in str(x.message)]
assert len(msgs) == 0
# Covered in dedicated string-dtype interop tests (see TestStringDtypeInterop).

Copilot uses AI. Check for mistakes.
@dsfulf dsfulf merged commit 6cda733 into main Mar 25, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants