Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading of old pandas dataframe (created in python 2) failed with 0.23.4 #24925

Closed
rbenes opened this issue Jan 25, 2019 · 6 comments

Comments

@rbenes
Copy link
Contributor

commented Jan 25, 2019

Hi,

Firstly I have to apologize, that my description will be very vague.

I have a problem with one of my dataframe that was created earlier with python 2 and older version of pandas (unfortunately I do not know what version). Now I cannot open it in python 3 and pandas 0.23.4 (loading in python 3 with pandas 0.22.0 works fine).

For reading, I am using:

hdf = pd.HDFStore(src_filename, mode=”r”)
data_frame = hdf.select(src_tablename)

My stack trace in pandas 0.23.4 is:

Traceback (most recent call last):
    data_frame = hdf.select(src_tablename)
  File "/home/rbenes/virtual_envs/iface_venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 743, in select
    return it.get_result()
  File "/home/rbenes/virtual_envs/iface_venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 1485, in get_result
    results = self.func(self.start, self.stop, where)
  File "/home/rbenes/virtual_envs/iface_venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 734, in func
    columns=columns)
  File "/home/rbenes/virtual_envs/iface_venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 4182, in read
    if not self.read_axes(where=where, **kwargs):
  File "/home/rbenes/virtual_envs/iface_venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 3385, in read_axes
    errors=self.errors)
  File "/home/rbenes/virtual_envs/iface_venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 2195, in convert
    self.data, nan_rep=nan_rep, encoding=encoding, errors=errors)
  File "/home/rbenes/virtual_envs/iface_venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 4658, in _unconvert_string_array
    data = libwriters.string_array_replace_from_nan_rep(data, nan_rep)
  File "pandas/_libs/writers.pyx", line 158, in pandas._libs.writers.string_array_replace_from_nan_rep
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

This stack trace led me to this pull request: #24510

If I list it e.g. with h5ls it looks fine (it is loaded and content looks fine).

Unfortunately, I cannot share the dataframe, because it is private and I cannot reproduce process of the creation with older versions any more :-(. So I am not able to deliver that unreable dataframe.

I debuged pandas and found, that this patch helped me.

diff --git a/pandas/io/pytables.py b/pandas/io/pytables.py
index 4e103482f..2ab6ddb5b 100644
--- a/pandas/io/pytables.py
+++ b/pandas/io/pytables.py
@@ -3288,7 +3288,7 @@ class Table(Fixed):
         self.nan_rep = getattr(self.attrs, 'nan_rep', None)
         self.encoding = _ensure_encoding(
             getattr(self.attrs, 'encoding', None))
-        self.errors = getattr(self.attrs, 'errors', 'strict')
+        self.errors = _ensure_decoded(getattr(self.attrs, 'errors', 'strict'))
         self.levels = getattr(
             self.attrs, 'levels', None) or []
         self.index_axes = [

Can anyone advice me, if such a fix is fine and if yes, can I send it as pull request without any reproducer?

Thank you.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2019

you should try with 0.24.0 which is releasing today and has that patch

@rbenes

This comment has been minimized.

Copy link
Contributor Author

commented Jan 25, 2019

I know, that pull reques: #24510 will be in 0.24, but my adition of _ensure_decoded() in my patch is on different place.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2019

your diff looks the same

@rbenes

This comment has been minimized.

Copy link
Contributor Author

commented Jan 25, 2019

I am not sure...
My patch is trying to change this - https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L3291

But the pull request mentioned changed this - https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L2524

Maybe there is some hierarchy, that I do not see, but without my patch the master (that will probably be the base for 0.24?) fails in my case (I know, my case is specific).

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2019

well this would require a test ; construct a dummy file that fails and a patch fixes, just like the ref issue

@rbenes

This comment has been minimized.

Copy link
Contributor Author

commented Jan 31, 2019

I found the reproducer

saving of dataframe:

df_orig = pd.DataFrame({
    "a": ["a", "b"],
    "b": [2, 3]
})

filename = "a.h5"

hdf = pd.HDFStore(filename, mode="w")
hdf.put("table", df_orig, format='table', data_columns=True, index=None)
hdf.close()

env:
Python 2.7.15
pandas 0.23.4
numpy 1.16.0

loading:

hdf = pd.HDFStore(filename, mode="r")
df_loaded = hdf.select("table")
hdf.close()
print("loaded")

print(df_loaded.equals(df_orig))

env:
Python 3.6.7
pandas 0.23.4
numpy 1.14.3

Traceback (most recent call last):
  File "pandas_test.py", line 19, in <module>
    df_loaded = hdf.select("table")
  File "/home/rbenes/virtual_envs/venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 741, in select
    return it.get_result()
  File "/home/rbenes/virtual_envs/venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 1483, in get_result
    results = self.func(self.start, self.stop, where)
  File "/home/rbenes/virtual_envs/venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 734, in func
    columns=columns)
  File "/home/rbenes/virtual_envs/venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 4180, in read
    if not self.read_axes(where=where, **kwargs):
  File "/home/rbenes/virtual_envs/venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 3383, in read_axes
    errors=self.errors)
  File "/home/rbenes/virtual_envs/venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 2193, in convert
    self.data, nan_rep=nan_rep, encoding=encoding, errors=errors)
  File "/home/rbenes/virtual_envs/venv36_new_pkgs/lib/python3.6/site-packages/pandas/io/pytables.py", line 4656, in _unconvert_string_array
    data = libwriters.string_array_replace_from_nan_rep(data, nan_rep)
  File "pandas/_libs/writers.pyx", line 158, in pandas._libs.writers.string_array_replace_from_nan_rep
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

so i will prepare pull request with test with this dummy dataframe...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.