Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: ensure numpy fixed-size binary/string arrays are converted to object dtype for Index creation #57995

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

PF2100
Copy link

@PF2100 PF2100 commented Mar 25, 2024

  • closes BUG: Cannot use numpy FLS as indicies since pandas 2.2.1 #57645
  • [Tests added and passed] if fixing a bug or adding a new feature
  • All [code checks passed].
  • Added [type annotations] to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file if fixing a bug or adding a new feature.

While using the function set_index with parameter inplace=True, the function would try and create a new index where its dtype would be a FLS S{value} dtype, which was not recognized by the function _dtype_to_subclass and raised a NotImplementedError. That said , by adding a verification that recognizes FLS dtype , the index is created successfully and the function executes properly.

….2.1

While using the function set_index with parameter inplace=True, the function would try and create a new index where its dtype would be a FLS S{value} dtype, which was not recognized by the function _dtype_to_subclass and raised a NotImplementedError. That said , by adding a verification that recognizes FLS dtype , the index is created successfully and the function executes properly.
@PF2100 PF2100 force-pushed the 57645-cannot-detect-FLSdtype-in-base-_dtype_to_subclass branch from 39fcc6a to 4b27a30 Compare March 25, 2024 23:24
@PF2100 PF2100 marked this pull request as ready for review March 26, 2024 00:17
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 26, 2024
@jorisvandenbossche jorisvandenbossche changed the title FIX #57645: Cannot use numpy FLS as indicies since pandas 2.2.1 REGR: ensure numpy fixed-size binary/string arrays are converted to object dtype for Index creation Apr 29, 2024
@jorisvandenbossche jorisvandenbossche added Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version Index Related to the Index class or subclasses and removed Stale labels Apr 29, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.2.3 milestone Apr 29, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this, and apologies for the lack of reviews!

The original PR that introduced this change edited the array_tuplesafe function (#57139), which is called inside Index.__new__ before calling _dtype_to_subclass. So if we would change array_tuplesafe to still return object dtype instead of numpy S/U, that would fix it before getting to _dtype_to_subclass.

The current fix also essentially allows to create an Index with an underlying numpy array with S6 dtype, while I think we want to restore the behaviour of converting to object dtype (I think my suggestion above should do that)

Other comment: could you also add a test for direct Index constructor from a Series object (i.e. pd.Index(pd.Series(arr))), because that is affected as well, in addition to set_index

Comment on lines +621 to +626
string_length = 6
in_dtype, df_name = f"S{string_length}", "fruit"
data = ["apple", "banana", "orange", "grape"]

# Create array with FLS(|S{value}) dtype
arr = np.array(data, dtype=in_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purpose of the tests, we can simplify the snippet from the issue report a bit

Suggested change
string_length = 6
in_dtype, df_name = f"S{string_length}", "fruit"
data = ["apple", "banana", "orange", "grape"]
# Create array with FLS(|S{value}) dtype
arr = np.array(data, dtype=in_dtype)
# Create array with FLS(|S{value}) dtype
arr = np.array( ["apple", "banana", "orange", "grape"], dtype="S6")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the sugestion and I will change the test. Regarding the asarray_tuplesafe, I went into a little more detail in this PR's issue (#57645) but looking back into the original PR, the line elif isinstance(values, ABCSeries) was added which allows for any Series to keep their original types and values without performing any check if their type is compatible or not.

Could a possible solution be something of the sort:

elif isinstance(values, ABCSeries) and values.dtype.kind != "S":
        return values_values

By refactoring this line, the elif will fail if the values have dtype ="S" and creates a new array with dtype object as it used to. Before I proceed, could I know your thoughts on the suggestion I've made?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Cannot use numpy FLS as indicies since pandas 2.2.1
2 participants