Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFStore: unable to create index, no error message #28156

Closed
adamjstewart opened this issue Aug 26, 2019 · 2 comments · Fixed by #34983
Closed

HDFStore: unable to create index, no error message #28156

adamjstewart opened this issue Aug 26, 2019 · 2 comments · Fixed by #34983
Labels
Bug Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Milestone

Comments

@adamjstewart
Copy link
Contributor

I was trying to follow the documentation at https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#indexing but ran into an unintuitive bug with HDFStore index creation. I thought I would report it in case someone else runs across this problem.

First, I create 2 dataframes and an HDFStore:

>>> import pandas as pd
>>> import numpy as np
>>> df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
>>> df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
>>> st = pd.HDFStore('appends.h5', mode='w')

Now, when I append, if I do:

>>> st.append('df', df_1, data_columns=['B'], index=False)
>>> st.append('df', df_2, data_columns=['B'], index=False)

I can successfully create an index:

>>> st.create_table_index('df', columns=['B'], optlevel=9, kind='full')
>>> st.get_storer('df').table
/df/table (Table(20,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "B": Index(9, full, shuffle, zlib(1)).is_csi=True}

But if I instead leave out the data_columns:

>>> st.append('df', df_1, index=False)
>>> st.append('df', df_2, index=False)

no index is created:

>>> st.create_table_index('df', columns=['B'], optlevel=9, kind='full')
>>> st.get_storer('df').table
/df/table (Table(20,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (2730,)

This is unintuitive for 2 reasons:

  1. Why does HDFStore need to know the indexable columns during append and during create_table_index?
  2. Why doesn't create_table_index raise an error message when it isn't able to create an index?

I think fixing either 1 or 2 would make things much more intuitive.

@jbrockmendel jbrockmendel added the IO HDF5 read_hdf, HDFStore label Oct 16, 2019
@mroeschke mroeschke added the Bug label May 16, 2020
@arw2019
Copy link
Member

arw2019 commented Jun 24, 2020

I reproduce this bug exactly as above on the master version of pandas.

Output of pd.show_versions() INSTALLED VERSIONS ------------------ commit : 526f404 python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-106-generic Version : #107-Ubuntu SMP Thu Jun 4 11:27:52 UTC 2020 machine : x86_64 processor : byteorder : little LC_ALL : C.UTF-8 LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+1940.g526f40431
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1.post20200616
Cython : 0.29.20
pytest : 5.4.3
hypothesis : 5.18.0
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : 0.4.0
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

@arw2019
Copy link
Member

arw2019 commented Jun 25, 2020

I think that the answer is that you should not be allowed to read a column if you did not specify it as a data_column. See this doc:
https://pandas.pydata.org/pandas-docs/version/0.15.1/io.html

If everyone agrees with that then we should be raising an AttributeError when a user attempts this

@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Jun 25, 2020
@jreback jreback added this to the 1.1 milestone Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@jreback @jbrockmendel @mroeschke @adamjstewart @arw2019 and others