Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: HDFStore.append with encoded string itemsize #11240

Closed
wants to merge 1 commit into from

Conversation

TomAugspurger
Copy link
Contributor

Closes #11234

Failure came when the maximum length of the unencoded string
was smaller than the maximum encoded length.

Need to run a perf check still. We end up having to call _convert_string_array twice, once before we know the min_itemsize, and a second time just before appending once we do know the min itemsize.

@TomAugspurger
Copy link
Contributor Author

So initially I had to call _convert_string_array twice, the second time just to change the length. But it seems like we can just astype the converted string to the new length, which is quite a bit faster.

In [1]: from pandas.io.pytables import _convert_string_array

In [2]: a = np.array(10000 * ['ë́'])

In [3]: c =  _convert_string_array(a, 'utf-8')

In [4]: %timeit _convert_string_array(a, 'utf-8')
100 loops, best of 3: 5.37 ms per loop

In [5]: %timeit c.astype('|S6')
10000 loops, best of 3: 91.4 µs per loop

All the tests passed, but I've never worked with numpy's fixed length strings. Hopefully I'm not breaking any rules here.

@jreback jreback added Bug Unicode Unicode strings IO HDF5 read_hdf, HDFStore labels Oct 5, 2015
@jreback jreback added this to the 0.17.1 milestone Oct 5, 2015
@@ -42,3 +42,10 @@ Performance Improvements

Bug Fixes
~~~~~~~~~


- Bug in HDFStore.append with strings whose encoded length exceded the max unencoded length (:issue:`11234`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around HDFStore.append


def convert_string_data(self, data, itemsize, encoding):
return _convert_string_array(data, encoding, itemsize)
self.set_data(data_converted.astype('|S%d' % itemsize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right that is ok (though maybe pass copy=False) (e.g. if we didn't change it don't want to copy again)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know that was a kwarg to astype, cool.

I'll ping you when it's green.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep.....these 0.17.1 will have to wait till after I release 0.17.0 (on friday).....but getting them lined up

Failure came when the maximum length of the unencoded string
was smaller than the maximum encoded lenght.
@jreback
Copy link
Contributor

jreback commented Oct 9, 2015

merged via 26db172

thanks @TomAugspurger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants