Skip to content

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Dec 7, 2022

I wasn't accounting properly for null termination bytes in my original implementation. I now append a null byte to the result of getitem.

I've also added tests and fixed some bugs around making sure data are truncated properly at array creation time and casting time if the dtype is too narrow to contain the full input data.

@seberg
Copy link
Member

seberg commented Dec 8, 2022

Just to note, NumPy's string/bytes does not store that null terminator, there is an implicit null terminator behind things. Of course you have to be very careful to not copy it.

@mattip
Copy link
Member

mattip commented Dec 8, 2022

I think this is wrong: the dtype itemsize should match the name.

@ngoldbaum
Copy link
Member Author

Thanks for the feedback, I’m not a Numpy expert so I’m going to make mistakes like this and I really appreciate getting guidance towards the right approach.

When I took a look at the bytes type yesterday I thought there had to be null terminators in the array data but I guess I was wrong.

Should the dtype getitem add the null terminator when someone accesses a scalar?

@seberg
Copy link
Member

seberg commented Dec 8, 2022

Should the dtype getitem add the null terminator when someone accesses a scalar?

I would lean towards yes. Null termination should make some things easier and wasting one byte really doesn't matter for scalars.

# dtype = ASCIIDType()
# arr = np.array(["hello", "this", "is", "an", "array"], dtype=dtype)
# assert repr(arr) == ("array(['', '', '', '', ''], dtype=ASCIIDType(0))")
# assert arr.tobytes() == b""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are commented out because they depend on numpy/numpy#22763 being merged in numpy.

@ngoldbaum
Copy link
Member Author

Now it doesn’t store the null characters anymore and getitem will pad the result with null.

I also realized the contiguous casting loop isn’t terribly useful because almost all casts aren’t aligned, so I’ve made it so there’s only one casting loop.

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Dec 8, 2022

because almost all casts aren’t aligned

In particular, I found that the check for whether a cast is aligned does npy_uint_alignment(dtype->elsize), which only returns true for dtypes that are 1, 4, 8, or 16 bytes wide.

@seberg
Copy link
Member

seberg commented Dec 9, 2022

Eeek, the copying code has additional alignment for the purpose of complex loops... Because we copy complex64 (two 32bit floats) via uint64 which has a lrger alignment.

None of that is ideal, but due to that the alignment passed in is larger than you would think.

@ngoldbaum ngoldbaum merged commit 3b0d427 into numpy:main Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants