Support for numpy strings #5261

cancan101 · 2013-10-19T00:25:06Z

This is more of a question than an issue:

why does Pandas use object type rather than the native fixed length numpy strings?

For example:

In [24]: np.array(["foo", "baz"])
Out[24]: 
array(['foo', 'baz'], 
      dtype='|S3')

but:

In [29]: pd.DataFrame({"a":np.array(["foo", "baz"])}).dtypes
Out[29]: 
a    object
dtype: object

The text was updated successfully, but these errors were encountered:

jreback · 2013-10-19T01:18:01Z

well it's a lot simpler

fixed arrays are nice but when changing elements you have to be very careful to avoid truncation

you would also have another type to support as you always need object type (as it's the top of the hierarchy)

most cython routines accept object already ; these would need conversion both to and from

in my experience you need special handling when dealing with Unicode (so you soul need another type or just make these object)

so could be done but big project

cancan101 · 2013-10-19T01:26:47Z

That makes sense. I would be curious to know what they advantages of native
string support would be.
On Oct 18, 2013 9:18 PM, "jreback" notifications@github.com wrote:

well it's a lot simpler

fixed arrays are nice but when changing elements you have to be very
careful to avoid truncation

you would also have another type to support as you always need object type
(as it's the top of the hierarchy)

most cython routines accept object already ; these would need conversion
both to and from

in my experience you need special handling when dealing with Unicode (so
you soul need another type or just make these object)

so could be done but big project

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/5261#issuecomment-26640331
.

jreback · 2013-10-19T01:29:20Z

in theory fixed array sizes should be much faster
so of u r doing lots of string stuff it might be worth it

however another big issue is the lack of numpy nan support
which using object type fixes (and series.str deals with properly)

jreback · 2015-06-17T10:48:49Z

see comments on #10351. In light of categoricals (and lack of nan support) this is a non-starter for general strings.

jreback · 2015-06-17T10:50:04Z

But will leave open as a specialized dtype option for a certain subset of analysis (e.g. its prob useful to the bio guys)

frol · 2017-03-26T07:08:33Z

That is unfortunate to me that Pandas cannot just do what I asked it to do. I know that I should exercise care using fixed width strings, but they are much faster than Python objects.

kawing-chiu · 2018-12-14T08:47:06Z

Another drawback to not having fixed-length numpy string: When using pyarrow, presence of columns with a dtype of 'object' will prevent pyarrow to (de)serialize in a zero-copy fashion.

As the user of the data, often I can determine the maximum length of some text columns. The string routines can happily truncate my strings to the specified length (the same for assignment).

TomAugspurger · 2018-12-14T13:59:56Z

I suspect that proper-string support will land sometime soon. Perhaps next year sometime.

In the meantime, you could write a simple extension array to convince pandas to not coerce your fixed-width ndarray of strings to object.

You shouldn't take this direct approach, since it's using private methods, but this is the basic idea:

In [39]: class NumPyString(pd.core.arrays.PandasArray):
    ...:     _typ = 'extensionarray'

In [40]: arr = np.array(['a', 'b', 'c'])

In [41]: s = pd.Series(NumPyString(arr))

In [42]: s
Out[42]:
0    a
1    b
2    c
dtype: str32

In [46]: s.array._ndarray
Out[46]: array(['a', 'b', 'c'], dtype='<U1')

Things like string methods won't work either.

rhshadrach · 2020-08-06T12:39:23Z

With StringDType, I think this issue can be closed.

TomAugspurger · 2020-08-06T18:01:54Z

Agreed, thanks. xref #35169 for those interested in following along.

luizirber mentioned this issue Apr 20, 2014

Sequence Representation in Biopandas genomika/biopandas#2

Open

ARF1 mentioned this issue Apr 18, 2015

Consider adding a Categorical datatype to bcolz Blosc/bcolz#66

Open

jreback mentioned this issue Jun 17, 2015

Respect numpy fixed length strings #10351

Closed

jreback closed this as completed Jun 17, 2015

jreback reopened this Jun 17, 2015

dgmp88 mentioned this issue Dec 15, 2017

Converting type object columns to strings fails #18796

Closed

chrisjbillington mentioned this issue Oct 30, 2018

Allow list of unicode strings to be saved as attrs h5py/h5py#1032

Merged

chris-b1 mentioned this issue Dec 4, 2018

DataFrames don't handle input arrays of dtype str/bytes consistently. #24085

Closed

rth mentioned this issue Oct 23, 2019

UTF8 string dtype #29183

Closed

TomAugspurger closed this as completed Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for numpy strings #5261

Support for numpy strings #5261

cancan101 commented Oct 19, 2013

jreback commented Oct 19, 2013

cancan101 commented Oct 19, 2013

jreback commented Oct 19, 2013

jreback commented Jun 17, 2015

jreback commented Jun 17, 2015

frol commented Mar 26, 2017

kawing-chiu commented Dec 14, 2018 •

edited

Loading

TomAugspurger commented Dec 14, 2018 •

edited

Loading

rhshadrach commented Aug 6, 2020

TomAugspurger commented Aug 6, 2020

Support for numpy strings #5261

Support for numpy strings #5261

Comments

cancan101 commented Oct 19, 2013

jreback commented Oct 19, 2013

cancan101 commented Oct 19, 2013

jreback commented Oct 19, 2013

jreback commented Jun 17, 2015

jreback commented Jun 17, 2015

frol commented Mar 26, 2017

kawing-chiu commented Dec 14, 2018 • edited Loading

TomAugspurger commented Dec 14, 2018 • edited Loading

rhshadrach commented Aug 6, 2020

TomAugspurger commented Aug 6, 2020

kawing-chiu commented Dec 14, 2018 •

edited

Loading

TomAugspurger commented Dec 14, 2018 •

edited

Loading