Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: long str unconvert fix #6166

Closed
wants to merge 1 commit into from

Conversation

wabu
Copy link
Contributor

@wabu wabu commented Jan 29, 2014

When storing long strings in hdf5, they get truncated at 64bytes when converted back to strings. The fix computes the itemsize of the strings and uses it before converting to strings. It would be possible to use the attribute information of the table, but that would require changes in the calling code.

Test that failed without the patch is included.

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

this was a fixed bug in numpy 1.7.2
numpy/numpy#2485

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

needs a perf check

iirc this was an odd bottleneck

course it only matters if the data is encoded

@wabu
Copy link
Contributor Author

wabu commented Jan 29, 2014

I did an update on numpy and it fixed the issue. If you think the I should pref check the patch for numpy <= 1.7.1, I could do this, otherwise you can close the pr

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

no...I like this PR!

I don't think its actually an issue (the issue is in decoding strings, which is why I used this method in the first place which is quite fast). True you have to figure out the max length of the string...

tell you what...why don't you put a conditional based on numpy <= 1.7.1 and then do your way, otherwise leave it? (use LooseVersion)

@wabu
Copy link
Contributor Author

wabu commented Feb 6, 2014

I did the perf test, nothing changed significantly, and also implemented the check for the numpy version

@@ -4157,8 +4158,8 @@ def _convert_string_array(data, encoding, itemsize=None):
data = np.array(data, dtype="S%d" % itemsize)
return data


def _unconvert_string_array(data, nan_rep=None, encoding=None):
_numpy_needs_str_fix = LooseVersion(np.__version__) < '1.7.2'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this like: _np_version_under_172 and put near the top of the file

@wabu
Copy link
Contributor Author

wabu commented Feb 9, 2014

Did the changes and added release notes.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2014

This change is actually very tricky and subtle because of exactly 3.3 / 2.7 interact with encodings....

still failing on 3.2 on 1.7.1.....

jreback@6c268de

I am not sure how to specify a dtype on 3.2 like 'S100'....maybe you can figure this out?
its essentially a unicode dtype

@wabu
Copy link
Contributor Author

wabu commented Feb 10, 2014

what about np.dtype((str,itemsize))? Results in '<U{size}' for python 3 and '|S{size}' for python 2, which seems to be ok?

@jreback
Copy link
Contributor

jreback commented Feb 10, 2014

do u have the ability to try in py3.3 on windows
that was the one throwing the error originally (weird but not on Linux)

u can use the just released 3.1 and try modifying the source
as compiling in windows is a bit tricky

if u want to try compiling on windows I can help though

@wabu
Copy link
Contributor Author

wabu commented Feb 10, 2014

can't help with this, have no windows machine around.

@jreback
Copy link
Contributor

jreback commented Feb 10, 2014

ok no problem
pls update with the encoding tests and your change and I'll give a whirl in a couple of days

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

I did a PR on your branch with some fixes to make this work on windows: wabu#1

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@wabu pls incorporate my changes when u can

prior to numpy 1.7.2 long strings got truncated when read back from a
hdf5 file
@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@wabu can you incoporate the changes and rebase?

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

closing in favor of #6821

@jreback jreback closed this Apr 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants