Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix categories in HDFStore not filtering correctly (#13322) #13792

Closed
wants to merge 1 commit into from

Conversation

shawnheide
Copy link
Contributor

@shawnheide shawnheide commented Jul 25, 2016

Fixed bug in pytables.py where series with categories were not being filtered properly.

@jreback
Copy link
Contributor

jreback commented Jul 25, 2016

always start with the tests

@jreback jreback added the IO HDF5 read_hdf, HDFStore label Jul 25, 2016
@shawnheide
Copy link
Contributor Author

Sorry Jeff, I'm not sure what you mean. Are you saying I should add a unit test for this?

@jreback
Copy link
Contributor

jreback commented Jul 25, 2016

ANY change requires a test.

@jreback
Copy link
Contributor

jreback commented Jul 25, 2016

at a minimum, tests that reproduce the original issue. You FIRST right the tests, make sure they replicate the failure, THEN write code to fix. You have fixed it when your tests then pass and nothing else breaks.

@shawnheide
Copy link
Contributor Author

Thanks for filling me in Jeff. Sorry for having to spell it out for me, but this is the first non-documentation PR I've submitted for pandas. I couldn't find any tests related to pytables but it seems like tests/frame/test_misc_api might be a good spot. Any insight would be appreciated. Thanks.

This is the code I wrote to test the fix before I submitted the pull:

obsids = ['ESP_012345_6789', 'ESP_987654_3210']
imgids = ['APF00006np', 'APF0001imm']
data = [4.3, 9.8]

df = pd.DataFrame(dict(obsids=obsids, imgids=imgids, data=data))
df.to_hdf('testdf_no_cats.hdf', 'df',format='t', data_columns=True)

df.obsids = df.obsids.astype('category')
df.imgids = df.imgids.astype('category')
df.to_hdf('testdf_with_cats.hdf', 'df',format='t', data_columns=True)

# df without categories
df2 = pd.read_hdf('testdf_no_cats.hdf', 'df', where='obsids=B')
assert(len(df2) == 0)

# df with categories
df3 = pd.read_hdf('testdf_with_cats.hdf', 'df', where='obsids=B')
assert(len(df3) == 0)

@jreback
Copy link
Contributor

jreback commented Jul 26, 2016

@shawnheide
Copy link
Contributor Author

Thanks Jeff, I completely missed that, I incorrectly assumed everything was in the one tests folder. Just submitted a new commit after confirming the test passed with my changes.

@codecov-io
Copy link

codecov-io commented Jul 26, 2016

Current coverage is 85.23% (diff: 100%)

Merging #13792 into master will increase coverage by <.01%

@@             master     #13792   diff @@
==========================================
  Files           140        140          
  Lines         50420      50422     +2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42975      42977     +2   
  Misses         7445       7445          
  Partials          0          0          

Powered by Codecov. Last update cc216ad...a461208

@@ -237,6 +237,30 @@ def roundtrip(key, obj, **kwargs):
finally:
safe_remove(path)

def test_conv_categorical(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this right after test_categorical, and rename to test_categorical_conversion

@jreback
Copy link
Contributor

jreback commented Jul 27, 2016

pls add a whatsnew in bug fix section.

@jreback jreback added the Bug label Jul 27, 2016
@@ -197,7 +197,10 @@ def stringify(value):
return TermValue(int(v), v, kind)
elif meta == u('category'):
metadata = com._values_from_object(self.metadata)
result = metadata.searchsorted(v, side='left')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this like:

result = metadata.searchsorted(v, side='left')
if not result and v not in metadata:
    result = -1

then the 'in' is only done if the result is not found (as opposed to always)

@shawnheide
Copy link
Contributor Author

@jreback Thanks for all of the feedback. I really appreciate you showing me how to do things the right way. I made the changes you requested and rebased the commit.

@@ -786,3 +786,4 @@ Bug Fixes
- Bugs in ``Index.difference`` and ``DataFrame.join`` raise in Python3 when using mixed-integer indexes (:issue:`13432`, :issue:`12814`)

- Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`)
- Bug in ``pd.read_hdf()`` returns incorrect result when HDF Store contains a DataFrame with a categorical column (:issue:`13792`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be more specific; this only doesn't work when the query has NO values that are found in the categorical.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

thanks @shawnheide

@shawnheide shawnheide deleted the BUG_13322 branch August 10, 2016 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

categories in HDFStore don't filter correctly
3 participants