Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow categoricals in msgpack #12573

Closed
wants to merge 1 commit into from

Conversation

pwaller
Copy link
Contributor

@pwaller pwaller commented Mar 9, 2016

Supercedes #12191.

I've made a best-effort to rebase this against master.

It seems to work here.

@jreback jreback added this to the 0.18.1 milestone Mar 9, 2016
@jreback
Copy link
Contributor

jreback commented Mar 9, 2016

  • add a note in whatsnew/0.18.1

ctor_dtype = np_dtype
if is_categorical_dtype(pd_dtype):
# Series ctor doesn't take dtype with categorical
ctor_dtype = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback - should the Series constructor accept a Categorical with dtype=category? This part felt hacky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it already does.
However in this case want to construct a pd.Categorical as you can pass all of the attributes thru

In [1]: Series(list('aabc'),dtype='category')
Out[1]: 
0    a
1    a
2    b
3    c
dtype: category
Categories (3, object): [a, b, c]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you don't need this here at all. the dtype is not present so its None, the Series will just take a Categorical directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I conclude that no change is required here in scope of this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this needs to be reworked a bit, seem my 2nd comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that might be a bug as in this case it is ok to specify the dtype (but it should be ignored as this is pretty generic as it doesn't specify anything in particular about the dtype, so its compatible). can you open a separate issue for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to start a new thread on this line of code.

So, if I understood, it shouldn't be necessary to special case the dtype? If I remove the dtype=ctor_dtype change (and leave it set to dtype=np_dtype), then it does not work. I get:

ValueError: cannot specify a dtype with a Categorical unless dtype='category'

I don't currently understand what is being asked for here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to rebase on master, that's the issue that was just merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was after rebasing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that's what I was first going to write. This is not a numpy-dtype and cannot be treated like one. Instead put this code in encode/decode (e.g. check isinstance(obj, Categorical)). Then you won't have this issue at all.

@pwaller
Copy link
Contributor Author

pwaller commented Mar 9, 2016

add a note in whatsnew/0.18.1

Done. Please let me know if it is to taste.

@@ -23,6 +23,9 @@ Enhancements
~~~~~~~~~~~~


- ``read_msgpack`` now supports categoricals (:issue:`12573`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support for serializing and de-serializing categoricalls with msgpack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -341,6 +342,7 @@ def setUp(self):
self.d['mixed'] = Series(data['E'])
self.d['dt_tz_mixed'] = Series(data['F'])
self.d['dt_tz'] = Series(data['G'])
self.d['categorical'] = Series(data['H'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 is this correct? Why is it Series and not Categorical?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't make a difference here, e.g. df['c'] = Categorical(..) and df['c'] = Series(Categorical(..)) do the same thing

@jreback
Copy link
Contributor

jreback commented Mar 16, 2016

ok #12574 has been merged. pls rebase/update.

@pwaller
Copy link
Contributor Author

pwaller commented Mar 16, 2016

Rebased.

@@ -30,7 +30,7 @@ Enhancements
~~~~~~~~~~~~



- support for serializing and de-serializing categoricalls with msgpack (:issue:`12573`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support.........categoricals (sp)

@pwaller pwaller force-pushed the msgpack-categorical branch 2 times, most recently from d5d077b to ecab0d4 Compare March 16, 2016 16:34
@pwaller
Copy link
Contributor Author

pwaller commented Mar 16, 2016

Wow, I think I managed to clear all of your comments. Everything suddenly made sense in the end. Please let me know if I missed something and if I should squash or reword my commits.

@pwaller
Copy link
Contributor Author

pwaller commented Mar 16, 2016

I broke the NDFrame tests. Investigating.

return {u'typ': u'categorical',
u'klass': u(obj.__class__.__name__),
u'name': getattr(obj, 'name', None),
u'codes_dtype': u(obj._codes.dtype.name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you actually need to specify the codes/catgories dtypes. Those will be automatic. IOW, don't use convert here, just same codes : obj.codes and categories : obj.categories. These are recursive encoder/decoders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return {u'typ': u'categorical',
u'klass': u(obj.__class__.__name__),
u'name': getattr(obj, 'name', None),
u'codes': obj._codes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob use obj.codes here (same but _codes is internal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

'G': [Timestamp('20130603', tz='CET')] * 5
'G': [Timestamp('20130603', tz='CET')] * 5,
'H': Categorical(['a', 'b', 'c', 'd', 'e']),
'I': Categorical(['a', 'b', 'c', 'd', 'e'], ordered=True),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in a strange position here. I added this test - it passes, but I didn't add the relevant code in convert/unconvert to pass the ordered parameter through. What gives?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that check is not used at all! I was just going to tell you to take it out. a Categorical is fully serialized/deserialized via encode/decode the dtype is NEVER category. except when its a series but that is already handled.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

@pwaller can you rebase/update

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

this looks pretty good code wise. can you squash everything.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

cc @kawochen

@pwaller pwaller force-pushed the msgpack-categorical branch 3 times, most recently from fc7e1f1 to 52ea6e4 Compare April 17, 2016 16:23
@pwaller
Copy link
Contributor Author

pwaller commented Apr 17, 2016

Squashed.

@@ -262,6 +265,9 @@ def convert(values):
v = values.ravel()

# convert object
if is_categorical_dtype(values):
return v

if dtype == np.object_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change line 271 to:

elif is_object_dtype(dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pwaller
Copy link
Contributor Author

pwaller commented Apr 24, 2016

Rebased and comments addressed.

@pwaller
Copy link
Contributor Author

pwaller commented Apr 24, 2016

I'm getting a test failure I'm not sure how to address and I don't have much time to look into it:

======================================================================
ERROR: test_basic_frame (pandas.io.tests.test_packers.TestNDFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/io/tests/test_packers.py", line 410, in test_basic_frame
    i_rec = self.encode_decode(i)
  File "/home/travis/build/pydata/pandas/pandas/io/tests/test_packers.py", line 75, in encode_decode
    return read_msgpack(p, **kwargs)
  File "/home/travis/build/pydata/pandas/pandas/io/packers.py", line 203, in read_msgpack
    return read(fh)
  File "/home/travis/build/pydata/pandas/pandas/io/packers.py", line 188, in read
    l = list(unpack(fh, encoding=encoding, **kwargs))
  File "_unpacker.pyx", line 459, in pandas.msgpack._unpacker.Unpacker.__next__ (pandas/msgpack/_unpacker.cpp:4709)
  File "_unpacker.pyx", line 390, in pandas.msgpack._unpacker.Unpacker._unpack (pandas/msgpack/_unpacker.cpp:3843)
  File "/home/travis/build/pydata/pandas/pandas/io/packers.py", line 639, in decode
    blocks = [create_block(b) for b in obj[u'blocks']]
  File "/home/travis/build/pydata/pandas/pandas/io/packers.py", line 637, in create_block
    dtype=b[u'dtype'])
  File "/home/travis/build/pydata/pandas/pandas/core/internals.py", line 2518, in make_block
    return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
  File "/home/travis/build/pydata/pandas/pandas/core/internals.py", line 1903, in __init__
    placement=placement, **kwargs)
  File "/home/travis/build/pydata/pandas/pandas/core/internals.py", line 1314, in __init__
    raise TypeError("values must be {0}".format(self._holder.__name__))
TypeError: values must be Categorical

@@ -393,6 +402,16 @@ def encode(obj):
u'dtype': u(obj.dtype.name),
u'data': convert(obj.values),
u'compress': compressor}

elif isinstance(obj, Categorical):
return {u'typ': u'categorical',
Copy link
Contributor

@jreback jreback Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be 'category'

@pwaller
Copy link
Contributor Author

pwaller commented Apr 25, 2016

Updated again. Thanks for your patience.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

To fix that error

diff --git a/pandas/io/packers.py b/pandas/io/packers.py
index ba38c8e..f009793 100644
--- a/pandas/io/packers.py
+++ b/pandas/io/packers.py
@@ -266,7 +266,7 @@ def convert(values):

     # convert object
     if is_categorical_dtype(values):
-        return v
+        return values

     if dtype == np.object_:
         return v.tolist()

DOC: support for categoricals in read_msgpack

Add TestCategorical test cases

Add Catecorical ordered=True ndframe test
@pwaller
Copy link
Contributor Author

pwaller commented Apr 25, 2016

Forced again.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

ok looks good. ping when green.

@jreback jreback closed this in 2fd0a06 Apr 25, 2016
@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

thanks!

@jankatins jankatins mentioned this pull request Apr 25, 2016
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants