Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes #5740

Merged
merged 1 commit into from
Feb 17, 2014

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Dec 19, 2013

DataFrame.dtypes and DataFrame.ftypes values were inconsistent for empty dataframes:

In [2]: pd.DataFrame().dtypes
Out[2]: 
Empty DataFrame
Columns: []
Index: []

[0 rows x 0 columns]

In [3]: pd.DataFrame().ftypes
Out[3]: 
Empty DataFrame
Columns: []
Index: []

[0 rows x 0 columns]

In [4]: pd.DataFrame(columns=list("abc")).ftypes
Out[4]: 
a   NaN
b   NaN
c   NaN
dtype: float64

In [5]: pd.DataFrame(columns=list("abc")).dtypes
Out[5]: 
a   NaN
b   NaN
c   NaN
dtype: float64

In [6]: pd.__version__
Out[6]: '0.13.0rc1-92-gf6fd509'

@immerrr
Copy link
Contributor Author

immerrr commented Dec 19, 2013

It is also faster.

OLD:

In [10]: timeit df.dtypes
100 loops, best of 3: 7.3 ms per loop

In [11]: timeit df.ftypes
100 loops, best of 3: 7.42 ms per loop

In [12]: pd.__version__
Out[12]: '0.13.0rc1-92-gf6fd509'

NEW:

In [1]: pd.__version__
Out[1]: '0.13.0rc1-93-g55d081e'

In [2]: df = pd.DataFrame(columns=range(100), index=[1])

In [3]: timeit df.dtypes
1000 loops, best of 3: 277 µs per loop

In [4]: timeit df.ftypes
1000 loops, best of 3: 259 µs per loop

assert_series_equal(norows_int_df.ftypes, pd.Series('int32:dense', index=list("abc")))

odict = OrderedDict
df = pd.DataFrame(odict([('a', 1), ('b', True), ('c', 1.0)]), index=[1, 2, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this test case here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a case below that tests an empty slice of dataframe with columns having different dtypes, I couldn't resist adding two more asserts just to be sure that the result matches that of non-empty slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this comment after an hour I think I might have misattributed the question to the beginning of df case, so just in case: tests concerning nocols_df, norows_df and norows_int_df are about empty dataframes (empty in a sense that they contain no actual data cells) and all returned garbage before patching dtype/ftype funcs.

@immerrr
Copy link
Contributor Author

immerrr commented Dec 19, 2013

Oh, it breaks duplicate column test case (reindex is no good when columns are not unique)

@immerrr
Copy link
Contributor Author

immerrr commented Dec 19, 2013

Fixed duplicate column test case.

Now dtypes/ftypes bodies feel like they belong to BlockManager class, not DataFrame, but I'm not sure if returning Series from lower-level code is OK.

Also, the code is looking for trouble if self._data.items can go out of sync with self.columns. I've found no such places but admittedly I'm not too familiar with the code.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2013

@immerrr

this is essentially an apply_by_block reduction (which could be a separate method, maybe)

ok....you can put dtypes/ftypes in core/generic.py

they should both be a property, something like

@property
def dtypes(self):
    return self._constructor_sliced(self._data.get_dtypes())

this will break Series (as it cannot be reduced except to a scalar), so you may need to override Series.dtypes to just return the same as Series.dtype

A Panel should breturn a DataFrame

you can use the code that you have currently in core/frame.py/dtypes and put it in core/internals.py/get_dtypes ((in BlockManager)
pretty much as is (ou should just return a dict or dict of dict in the internal method)

get_dtype_counts/get_ftype_counts could be similarly extended to Panel/Series (though these should return Series for Series/DataFrame and a DataFrame for Panel)

@jreback
Copy link
Contributor

jreback commented Dec 19, 2013

FYI changing self.columns calls self._data.set_axis() which resets the items internally so by definition they are the same

@jreback
Copy link
Contributor

jreback commented Dec 19, 2013

see get_dtype_counts() in core/internals.py

for get_dtypes()/get_ftypes()

you need to call consolidate_inplace()

assert_series_equal(norows_df.ftypes, pd.Series('object:dense', index=list("abc")))

norows_int_df = pd.DataFrame(columns=list("abc")).astype(np.int32)
assert_series_equal(norows_int_df.dtypes, pd.Series(np.dtype('int32'), index=list("abc")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback does this make sense? dtype's just going to change once you assign something anyways, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine; yes dtype will change once you assign something

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

@immerrr can you update this as discussed above?

@immerrr
Copy link
Contributor Author

immerrr commented Jan 4, 2014

@jreback I'm not sure I get you right.

  • Unlike {dtype: count} mapping, {column: dtype}'s keys are not necessarily unique since column names aren't restricted that way, hence returning a dict won't work. So, it's probably better to return [(column, dtype), ...] list (or just [dtype, ...] given that to construct Series of it you'll need to zip it anyways)
  • As far as I see in the code, Panel & Panel4D classes are subclasses of NDArray (just like DataFrame) which means they use the same BlockManager class under the hood which means that dtypes can only differ in one axis, why bother returning a dataframe/panel for higher-dimension containers? Are there plans to support more fine-grained dtype management?
  • If there are such plans, which axis is to be "reduced" when getting Panel dtypes, the minor one?
  • Why is it necessary to do consolidate_inplace? I mean it does seem beneficial to have all columns of same dtype inside single block, but I wouldn't expect reallocations in such a trivial property as df.dtypes.

@jreback
Copy link
Contributor

jreback commented Jan 4, 2014

Unlike {dtype: count} mapping, {column: dtype}'s keys are not necessarily unique since column names aren't restricted that way, hence returning a dict won't work. So, it's probably better to return [(column, dtype), ...] list (or just [dtype, ...] given that to construct Series of it you'll need to zip it anyways)

that's fine the internals method can just return that (or a Series directly for that matter)

As far as I see in the code, Panel & Panel4D classes are subclasses of NDArray (just like DataFrame) which means they use the same BlockManager class under the hood which means that dtypes can only differ in one axis, why bother returning a dataframe/panel for higher-dimension containers? Are there plans to support more fine-grained dtype management?

These are sub-class of NDFrame; dtypes are per-block; however the dimenisonaility is different. A DataFrame would return a Series of dtypes, while a Panel would return a DataFrame (depending on the axis, see #5850

I am not sure how you would support even more fine-grained dtyping under the current implementation, nor why it would be necessary

If there are such plans, which axis is to be "reduced" when getting Panel dtypes, the minor one?
see my above comment

Why is it necessary to do consolidate_inplace? I mean it does seem beneficial to have all columns of same dtype inside single block, but I wouldn't expect reallocations in such a trivial property as df.dtypes.
This operation should be done in pretty much evey internal operation; often times when interacting with blocks they are recreated, sometimes it is possible to create a new block that is the same shape, other times it is not (e.g. see convert for example)

This function needs to be done in core/internals.py as I indicated above. DataFrame cannot have that much knowledge of the internals as you propose. (and this method will nicely scale to other ndims as well).

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

fyi #5968 fixed the dtype slowness but would like to have these tests..so pls rebase when u have a chance

@immerrr
Copy link
Contributor Author

immerrr commented Jan 17, 2014

Whoops, forgot about this PR... Will do the rebase.

These are sub-class of NDFrame; dtypes are per-block; however the dimenisonaility is different. A DataFrame would return a Series of dtypes, while a Panel would return a DataFrame (depending on the axis, see #5850

I am not sure how you would support even more fine-grained dtyping under the current implementation, nor why it would be necessary

The dimensionality is different, that's true. But wouldn't Panel.dtypes returning a DataFrame give a false impression that dtypes can vary along some axis other than "items"? I definitely would've fallen for that.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2014

yep that is fixed up
u will get a series back for dtypes on higher dim objects

@jreback
Copy link
Contributor

jreback commented Jan 17, 2014

if you rebase this you wil see that dtypes are now fixed (and you can take out the changes in core/frame)

@immerrr
Copy link
Contributor Author

immerrr commented Jan 18, 2014

if you rebase this you wil see that dtypes are now fixed (and you can take out the changes in core/frame)

Indeed. I've enforced the returned Series dtype for consistency though, otherwise than that rebased tests pass.

@immerrr
Copy link
Contributor Author

immerrr commented Feb 16, 2014

Rebased against master. Can this be merged?

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

I think put a release note in as this is technically an API change (eg. that an empty Series/Frame would before be float64 I think), and now correctly object...can you update with that'

thxs

@immerrr
Copy link
Contributor Author

immerrr commented Feb 17, 2014

Whoops, now that I'm re-reading the last message I doubt that you wanted me to put this to bugfixes. Do you want to move it to API changes?

@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

API change

@immerrr
Copy link
Contributor Author

immerrr commented Feb 17, 2014

Ok

jreback added a commit that referenced this pull request Feb 17, 2014
BUG: return Series as DataFrame.dtypes/ftypes for empty dataframes
@jreback jreback merged commit f716c21 into pandas-dev:master Feb 17, 2014
@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

thanks!

@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

for future reference.....you have to compare np.intt64 not np.int as this will fail on windows

c87a058

@immerrr
Copy link
Contributor Author

immerrr commented Feb 18, 2014

Yup, those should be np.int_ and np.float_.

@immerrr immerrr deleted the fix-empty-frame-dtypes branch February 18, 2014 04:38
@immerrr immerrr restored the fix-empty-frame-dtypes branch March 3, 2014 17:14
@immerrr immerrr deleted the fix-empty-frame-dtypes branch March 3, 2014 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants