BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes #5740

immerrr · 2013-12-19T05:42:17Z

DataFrame.dtypes and DataFrame.ftypes values were inconsistent for empty dataframes:

In [2]: pd.DataFrame().dtypes
Out[2]: 
Empty DataFrame
Columns: []
Index: []

[0 rows x 0 columns]

In [3]: pd.DataFrame().ftypes
Out[3]: 
Empty DataFrame
Columns: []
Index: []

[0 rows x 0 columns]

In [4]: pd.DataFrame(columns=list("abc")).ftypes
Out[4]: 
a   NaN
b   NaN
c   NaN
dtype: float64

In [5]: pd.DataFrame(columns=list("abc")).dtypes
Out[5]: 
a   NaN
b   NaN
c   NaN
dtype: float64

In [6]: pd.__version__
Out[6]: '0.13.0rc1-92-gf6fd509'

immerrr · 2013-12-19T05:48:08Z

It is also faster.

OLD:

In [10]: timeit df.dtypes
100 loops, best of 3: 7.3 ms per loop

In [11]: timeit df.ftypes
100 loops, best of 3: 7.42 ms per loop

In [12]: pd.__version__
Out[12]: '0.13.0rc1-92-gf6fd509'

NEW:

In [1]: pd.__version__
Out[1]: '0.13.0rc1-93-g55d081e'

In [2]: df = pd.DataFrame(columns=range(100), index=[1])

In [3]: timeit df.dtypes
1000 loops, best of 3: 277 µs per loop

In [4]: timeit df.ftypes
1000 loops, best of 3: 259 µs per loop

jtratner · 2013-12-19T06:02:42Z

pandas/tests/test_frame.py

+        assert_series_equal(norows_int_df.ftypes, pd.Series('int32:dense', index=list("abc")))
+
+        odict = OrderedDict
+        df = pd.DataFrame(odict([('a', 1), ('b', True), ('c', 1.0)]), index=[1, 2, 3])


why is this test case here?

there's a case below that tests an empty slice of dataframe with columns having different dtypes, I couldn't resist adding two more asserts just to be sure that the result matches that of non-empty slice.

Revisiting this comment after an hour I think I might have misattributed the question to the beginning of df case, so just in case: tests concerning nocols_df, norows_df and norows_int_df are about empty dataframes (empty in a sense that they contain no actual data cells) and all returned garbage before patching dtype/ftype funcs.

immerrr · 2013-12-19T07:00:44Z

Oh, it breaks duplicate column test case (reindex is no good when columns are not unique)

immerrr · 2013-12-19T07:56:51Z

Fixed duplicate column test case.

Now dtypes/ftypes bodies feel like they belong to BlockManager class, not DataFrame, but I'm not sure if returning Series from lower-level code is OK.

Also, the code is looking for trouble if self._data.items can go out of sync with self.columns. I've found no such places but admittedly I'm not too familiar with the code.

jreback · 2013-12-19T12:17:38Z

@immerrr

this is essentially an apply_by_block reduction (which could be a separate method, maybe)

ok....you can put dtypes/ftypes in core/generic.py

they should both be a property, something like

@property
def dtypes(self):
    return self._constructor_sliced(self._data.get_dtypes())

this will break Series (as it cannot be reduced except to a scalar), so you may need to override Series.dtypes to just return the same as Series.dtype

A Panel should breturn a DataFrame

you can use the code that you have currently in core/frame.py/dtypes and put it in core/internals.py/get_dtypes ((in BlockManager)
pretty much as is (ou should just return a dict or dict of dict in the internal method)

get_dtype_counts/get_ftype_counts could be similarly extended to Panel/Series (though these should return Series for Series/DataFrame and a DataFrame for Panel)

jreback · 2013-12-19T12:19:44Z

FYI changing self.columns calls self._data.set_axis() which resets the items internally so by definition they are the same

jreback · 2013-12-19T12:24:54Z

see get_dtype_counts() in core/internals.py

for get_dtypes()/get_ftypes()

you need to call consolidate_inplace()

jtratner · 2013-12-19T12:53:10Z

pandas/tests/test_frame.py

+        assert_series_equal(norows_df.ftypes, pd.Series('object:dense', index=list("abc")))
+
+        norows_int_df = pd.DataFrame(columns=list("abc")).astype(np.int32)
+        assert_series_equal(norows_int_df.dtypes, pd.Series(np.dtype('int32'), index=list("abc")))


@jreback does this make sense? dtype's just going to change once you assign something anyways, no?

this is fine; yes dtype will change once you assign something

jreback · 2014-01-03T20:34:23Z

@immerrr can you update this as discussed above?

immerrr · 2014-01-04T21:21:02Z

@jreback I'm not sure I get you right.

Unlike {dtype: count} mapping, {column: dtype}'s keys are not necessarily unique since column names aren't restricted that way, hence returning a dict won't work. So, it's probably better to return [(column, dtype), ...] list (or just [dtype, ...] given that to construct Series of it you'll need to zip it anyways)
As far as I see in the code, Panel & Panel4D classes are subclasses of NDArray (just like DataFrame) which means they use the same BlockManager class under the hood which means that dtypes can only differ in one axis, why bother returning a dataframe/panel for higher-dimension containers? Are there plans to support more fine-grained dtype management?
If there are such plans, which axis is to be "reduced" when getting Panel dtypes, the minor one?
Why is it necessary to do consolidate_inplace? I mean it does seem beneficial to have all columns of same dtype inside single block, but I wouldn't expect reallocations in such a trivial property as df.dtypes.

jreback · 2014-01-04T22:14:57Z

Unlike {dtype: count} mapping, {column: dtype}'s keys are not necessarily unique since column names aren't restricted that way, hence returning a dict won't work. So, it's probably better to return [(column, dtype), ...] list (or just [dtype, ...] given that to construct Series of it you'll need to zip it anyways)

that's fine the internals method can just return that (or a Series directly for that matter)

As far as I see in the code, Panel & Panel4D classes are subclasses of NDArray (just like DataFrame) which means they use the same BlockManager class under the hood which means that dtypes can only differ in one axis, why bother returning a dataframe/panel for higher-dimension containers? Are there plans to support more fine-grained dtype management?

These are sub-class of NDFrame; dtypes are per-block; however the dimenisonaility is different. A DataFrame would return a Series of dtypes, while a Panel would return a DataFrame (depending on the axis, see #5850

I am not sure how you would support even more fine-grained dtyping under the current implementation, nor why it would be necessary

If there are such plans, which axis is to be "reduced" when getting Panel dtypes, the minor one?
see my above comment

Why is it necessary to do consolidate_inplace? I mean it does seem beneficial to have all columns of same dtype inside single block, but I wouldn't expect reallocations in such a trivial property as df.dtypes.
This operation should be done in pretty much evey internal operation; often times when interacting with blocks they are recreated, sometimes it is possible to create a new block that is the same shape, other times it is not (e.g. see convert for example)

This function needs to be done in core/internals.py as I indicated above. DataFrame cannot have that much knowledge of the internals as you propose. (and this method will nicely scale to other ndims as well).

jreback · 2014-01-16T22:14:44Z

fyi #5968 fixed the dtype slowness but would like to have these tests..so pls rebase when u have a chance

immerrr · 2014-01-17T11:49:47Z

Whoops, forgot about this PR... Will do the rebase.

These are sub-class of NDFrame; dtypes are per-block; however the dimenisonaility is different. A DataFrame would return a Series of dtypes, while a Panel would return a DataFrame (depending on the axis, see #5850

I am not sure how you would support even more fine-grained dtyping under the current implementation, nor why it would be necessary

The dimensionality is different, that's true. But wouldn't Panel.dtypes returning a DataFrame give a false impression that dtypes can vary along some axis other than "items"? I definitely would've fallen for that.

jreback · 2014-01-17T11:57:14Z

yep that is fixed up
u will get a series back for dtypes on higher dim objects

jreback · 2014-01-17T12:56:01Z

if you rebase this you wil see that dtypes are now fixed (and you can take out the changes in core/frame)

immerrr · 2014-01-18T07:58:01Z

if you rebase this you wil see that dtypes are now fixed (and you can take out the changes in core/frame)

Indeed. I've enforced the returned Series dtype for consistency though, otherwise than that rebased tests pass.

immerrr · 2014-02-16T18:29:31Z

Rebased against master. Can this be merged?

jreback · 2014-02-16T18:35:20Z

I think put a release note in as this is technically an API change (eg. that an empty Series/Frame would before be float64 I think), and now correctly object...can you update with that'

thxs

immerrr · 2014-02-17T07:04:59Z

Whoops, now that I'm re-reading the last message I doubt that you wanted me to put this to bugfixes. Do you want to move it to API changes?

jreback · 2014-02-17T07:34:21Z

API change

…ntainers

immerrr · 2014-02-17T07:56:50Z

Ok

BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes

jreback · 2014-02-17T14:00:02Z

thanks!

jreback · 2014-02-17T15:49:55Z

for future reference.....you have to compare np.intt64 not np.int as this will fail on windows

c87a058

immerrr · 2014-02-18T04:38:41Z

Yup, those should be np.int_ and np.float_.

jtratner reviewed Dec 19, 2013
View reviewed changes

jreback mentioned this pull request Jan 16, 2014

df.dtypes.values is not O(1) and repr(df) is therefore slow for large frames #5968

Closed

jreback added API Design labels Feb 16, 2014

ENH: ensure dtypes/ftypes properties return dtype=object for empty co…

dfa8ec0

…ntainers

jreback added a commit that referenced this pull request Feb 17, 2014

Merge pull request #5740 from immerrr/fix-empty-frame-dtypes

f716c21

BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes

jreback merged commit f716c21 into pandas-dev:master Feb 17, 2014

immerrr deleted the fix-empty-frame-dtypes branch February 18, 2014 04:38

immerrr restored the fix-empty-frame-dtypes branch March 3, 2014 17:14

immerrr deleted the fix-empty-frame-dtypes branch March 3, 2014 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes #5740

BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes #5740

immerrr commented Dec 19, 2013

immerrr commented Dec 19, 2013

jtratner Dec 19, 2013

immerrr Dec 19, 2013

immerrr Dec 19, 2013

immerrr commented Dec 19, 2013

immerrr commented Dec 19, 2013

jreback commented Dec 19, 2013

jreback commented Dec 19, 2013

jreback commented Dec 19, 2013

jtratner Dec 19, 2013

jreback Dec 19, 2013

jreback commented Jan 3, 2014

immerrr commented Jan 4, 2014

jreback commented Jan 4, 2014

jreback commented Jan 16, 2014

immerrr commented Jan 17, 2014

jreback commented Jan 17, 2014

jreback commented Jan 17, 2014

immerrr commented Jan 18, 2014

immerrr commented Feb 16, 2014

jreback commented Feb 16, 2014

immerrr commented Feb 17, 2014

jreback commented Feb 17, 2014

immerrr commented Feb 17, 2014

jreback commented Feb 17, 2014

jreback commented Feb 17, 2014

immerrr commented Feb 18, 2014

BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes #5740

BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes #5740

Conversation

immerrr commented Dec 19, 2013

immerrr commented Dec 19, 2013

jtratner Dec 19, 2013

Choose a reason for hiding this comment

immerrr Dec 19, 2013

Choose a reason for hiding this comment

immerrr Dec 19, 2013

Choose a reason for hiding this comment

immerrr commented Dec 19, 2013

immerrr commented Dec 19, 2013

jreback commented Dec 19, 2013

jreback commented Dec 19, 2013

jreback commented Dec 19, 2013

jtratner Dec 19, 2013

Choose a reason for hiding this comment

jreback Dec 19, 2013

Choose a reason for hiding this comment

jreback commented Jan 3, 2014

immerrr commented Jan 4, 2014

jreback commented Jan 4, 2014

jreback commented Jan 16, 2014

immerrr commented Jan 17, 2014

jreback commented Jan 17, 2014

jreback commented Jan 17, 2014

immerrr commented Jan 18, 2014

immerrr commented Feb 16, 2014

jreback commented Feb 16, 2014

immerrr commented Feb 17, 2014

jreback commented Feb 17, 2014

immerrr commented Feb 17, 2014

jreback commented Feb 17, 2014

jreback commented Feb 17, 2014

immerrr commented Feb 18, 2014