New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow get_dummies to accept dtype argument #18330

Merged
merged 24 commits into from Nov 22, 2017

Conversation

Projects
None yet
4 participants
@Scorpil
Contributor

Scorpil commented Nov 16, 2017

  • closes #18330 (there's no issue for this one)
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Update in version 0.19.0 made get_dummies return uint8 values instead of floats (#8725). While I agree with the argument that get_dummies should output integers by default (to save some memory), in many cases it would be beneficial for user to choose other dtype.

In my case there was serious performance degradation between versions 0.18 and 0.19. After investigation, reason behind it turned out to be the change to get_dummies output type. DataFrame with dummy values was used as an argument to np.dot in an optimization function (second argument was matrix of floats). Since there were lots of iterations involved, and on each iteration np.dot was converting all uint8 values to float64, conversion overhead took unreasonably long time. It is possible to work around this issue by converting dummy columns "manually" afterwards, but it adds unnecessary complexity to the code and is clearly less convenient than calling get_dummies with dtype=float.

Apart from performance considerations, I can imagine dtype=bool to be a common use case.

get_dummies(data, dtype=None) is allowed and will return uint8 values to match the DataFrame interface (where None allows inferring datatype, which is default behavior).

I've extended the test suite to run all the get_dummies tests (except for those that don't deal with internal dtypes, like test_just_na) twice, once with uint8 and once with float64.

@codecov

This comment has been minimized.

codecov bot commented Nov 16, 2017

Codecov Report

Merging #18330 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18330      +/-   ##
==========================================
- Coverage   91.35%   91.33%   -0.02%     
==========================================
  Files         163      163              
  Lines       49714    49719       +5     
==========================================
- Hits        45415    45410       -5     
- Misses       4299     4309      +10
Flag Coverage Δ
#multiple 89.13% <100%> (-0.01%) ⬇️
#single 39.63% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.73% <ø> (ø) ⬆️
pandas/core/reshape/reshape.py 100% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.39% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d421a09...158a317. Read the comment docs.

@@ -725,6 +725,8 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
drop_first : bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the
first level.
dtype : dtype, default np.uint8

This comment has been minimized.

@jreback

jreback Nov 17, 2017

Contributor

add a versionadded tag

@@ -217,34 +217,36 @@ def test_multiindex(self):
class TestGetDummies(object):

This comment has been minimized.

@jreback

jreback Nov 17, 2017

Contributor

instead of doing this define a fixture that returns the various dtypes that you are testing

@@ -140,7 +140,7 @@ Sparse
Reshaping
^^^^^^^^^
-
- :func:`get_dummies` now supports ``dtype`` argument

This comment has been minimized.

@jreback

jreback Nov 17, 2017

Contributor

add a little more expl, add the PR number as the issue number. Move to Other Enhancements section.

@Scorpil

This comment has been minimized.

Contributor

Scorpil commented Nov 18, 2017

All done. Also updated sparse tests to use fixtures as well. And added one test to verify effective dtype is uint8 when dtype argument is None.

@jreback

looks good generally. thanks for parametrizing the tests!

@@ -24,7 +24,17 @@ Other Enhancements
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`)
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`)
-

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

make a separate sub-section for this

@@ -24,7 +24,17 @@ Other Enhancements
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`)
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`)
-
- :func:`pandas.get_dummies` now supports ``dtype`` argument, which forces specific dtype for new columns. (:issue:`18330`)

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

say default is the same (uint8)

-
- :func:`pandas.get_dummies` now supports ``dtype`` argument, which forces specific dtype for new columns. (:issue:`18330`)
.. code-block:: ipython

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

use an ipython block, show the original as well (first)

self.df = DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'b', 'c'],
'C': [1, 2, 3]})
@pytest.fixture(params=['uint8', 'float64'])

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

cycle thru more dtypes here that are valid (doesn't have to be all, but include int64, bool, IOW valid for both sparse/dense)

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

add None as well

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

we should prob raise on object dtype I think.

expected = DataFrame({'a': [1, 0, 0],
'b': [0, 1, 0],
'c': [0, 0, 1]}, dtype=dtype)
result = get_dummies(s_list, **kwargs)

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

I wouldn't construct the kwargs, actually just pass directly

assert_frame_equal(res, exp)
# Sparse dataframes do not allow nan labelled columns, see #GH8822
res_na = get_dummies(s, dummy_na=True, sparse=self.sparse)
res_na = get_dummies(s, dummy_na=True, **kwargs)

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

see my comment above, don't use kwargs generally for passing args to test functions, rather pass directly

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 19, 2017

@pep8speaks

This comment has been minimized.

pep8speaks commented Nov 19, 2017

Hello @Scorpil! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 22, 2017 at 13:15 Hours UTC
@@ -24,7 +24,27 @@ Other Enhancements
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`)
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`)
-
``get_dummies`` now supports ``dtype`` argument

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

add a ref here as well.

``get_dummies`` now supports ``dtype`` argument
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:func:`get_dummies` function now accepts ``dtype`` argument, which forces specific dtype for new columns. When ``dtype`` is not specified or equals to ``None``, new columns will have dtype ``uint8`` (as before), so this change is backwards compatible. (:issue:`18330`)

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

The :func:`get_dummies` function now accepts adtypeargument, which forces a specific dtype for the new columns. The default isuint8ifdtypeis not specified orNone``.

if dtype is None:
dtype = np.uint8
if np.dtype(dtype) is np.dtype('O'):

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

use is_object_dtype

if np.dtype(dtype) is np.dtype('O'):
raise TypeError("'object' is not a valid type for get_dummies")

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

this could be a ValueError; also so dtype=object is not a valid dtype for get_dummies

return result
def _get_dummies_1d(data, prefix, prefix_sep='_', dummy_na=False,
sparse=False, drop_first=False):
sparse=False, drop_first=False, dtype=np.uint8):

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

should this be passthru?

This comment has been minimized.

@jreback

jreback Nov 20, 2017

Contributor

?

This comment has been minimized.

@Scorpil

Scorpil Nov 20, 2017

Contributor

Right, missed this one. Idea was to treat this one as internal and allow "wrapper" to set dtype, but passthru has it's advantages and I don't mind ether way, so I'll move dtype-related conversions to this method.

@TomAugspurger

Thanks.

I'll finish taking a look later but my only real concern is how many times you parametrize by the dtype. It's great to do that for some tests like test_basic_dtype and a few others, but I'm not sure about all the prefix / sep tests. Do you have specific concerns that you're trying to test there?

``get_dummies`` now supports ``dtype`` argument
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:func:`get_dummies` function now accepts ``dtype`` argument, which forces specific dtype for new columns. When ``dtype`` is not specified or equals to ``None``, new columns will have dtype ``uint8`` (as before), so this change is backwards compatible. (:issue:`18330`)

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor
  • Can remove function after get_dummies.
  • now accepts a dtype argument
  • replace forces with specifies a

Replace the second sentence with

When ``dtype`` is not specified, the dtype will be ``uint8`` as before.
.. ipython:: python
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
pd.get_dummies(df, columns=['c'])

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

I think you can remove the "Previous behavior" section since this is backwards compatible.

I'd just do

pd.get_dummies(df, columns=['c']).dtypes
pd.get_dummies(df, columns=['c'], dtype=bool).dtypes
@@ -697,7 +697,7 @@ def _convert_level_number(level_num, columns):
def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
columns=None, sparse=False, drop_first=False):
columns=None, sparse=False, drop_first=False, dtype=None):

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

Any reason not to use 'uint8' or np.uint8 here?

This comment has been minimized.

@Scorpil

Scorpil Nov 19, 2017

Contributor

I've tried to mirror API of DataFrame, Series, Panel etc. where passing None explicitly is allowed and means "dtype will be inferred".

This comment has been minimized.

@Scorpil

Scorpil Nov 20, 2017

Contributor

@jreback @TomAugspurger So this is the last question to answer. Do you accept my argument about None or should I change it to np.uint8?

This comment has been minimized.

@jreback

jreback Nov 20, 2017

Contributor

this is ok here, it follows a similar style elsewhere

@@ -728,6 +728,11 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
.. versionadded:: 0.18.0
dtype : dtype, default np.uint8

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

Can we also accept arguments to np.dtype like the string 'i8', and handle those appropriately?

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

he is using np.dtype()?

This comment has been minimized.

@Scorpil

Scorpil Nov 19, 2017

Contributor

Yes, that should work already, I'll add it to the tests.

# e.g. TestGetDummies::test_basic[uint8-sparse] instead of [uint8-True]
return request.param == 'sparse'
def effective_dtype(self, dtype):

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

If we make the default np.uint8 you can remove this.

'C': [1, 2, 3]})
@pytest.fixture(params=['uint8', 'int64', np.float64, bool, None])
def dtype(self, request):

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

If we use this fixture in many places it's going to add a ton of tests.

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

these are all quick, so this is ok

This comment has been minimized.

@Scorpil

Scorpil Nov 19, 2017

Contributor

Initially I only had 'uint8' and 'float64', but @jreback reasonably suggested to add some more. What would be a good balance here? If I'll remove usage of this fixture from all the unrelated tests like prefix / separator tests, and move None to separate stand-alone test, would ['uint8', 'i8', np.float64, bool] be OK? Still x4 number of tests, but each item uses a different way to specify dtype, so i think it's meaningful set of fixtures.

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

i think toms point is to not apply this fixture to every test (just relevant ones)

def test_dataframe_dummies_all_obj(self):
df = self.df[['A', 'B']]
result = get_dummies(df, sparse=self.sparse)
def test_dataframe_dummies_all_obj(self, df, sparse, dtype):

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

What's the benefit of parametrizing by dtype here?

assert_frame_equal(result, expected)
def test_dataframe_dummies_prefix_str(self):
def test_dataframe_dummies_prefix_str(self, df, sparse, dtype):

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

Same question. Is there any relationship between the prefix and dtype? The seem orthogonal.

assert_frame_equal(result, expected)
def test_dataframe_dummies_subset(self):
df = self.df
def test_dataframe_dummies_subset(self, df, sparse, dtype):

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

Same question: Any interaction between subset and dtype?

def test_dataframe_dummies_prefix_sep(self):
df = self.df
result = get_dummies(df, prefix_sep='..', sparse=self.sparse)
def test_dataframe_dummies_prefix_sep(self, df, sparse, dtype):

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 19, 2017

Contributor

Same question :)

@@ -27,6 +27,17 @@ Other Enhancements
- :class:`pandas.io.formats.style.Styler` now has method ``hide_index()`` to determine whether the index will be rendered in ouptut (:issue:`14194`)
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`)

This comment has been minimized.

@jreback

jreback Nov 20, 2017

Contributor

can you add a ref here

This comment has been minimized.

@Scorpil

Scorpil Nov 20, 2017

Contributor

I'm sorry, not used to work with sphinx. Do you mean something like this here:

- :func:`get_dummies` now supports ``dtype`` argument, see :ref:`here <whatsnew_0220.enhancements.get_dummies_dtype>` for more  (:issue: `18330`)

and then this before the actual description block:

.. _whatsnew_0220.enhancements.get_dummies_dtype

?

@@ -697,7 +697,7 @@ def _convert_level_number(level_num, columns):
def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
columns=None, sparse=False, drop_first=False):
columns=None, sparse=False, drop_first=False, dtype=None):

This comment has been minimized.

@jreback

jreback Nov 20, 2017

Contributor

this is ok here, it follows a similar style elsewhere

See Also
--------
Series.str.get_dummies
"""
from pandas.core.reshape.concat import concat
from itertools import cycle
if dtype is None:
dtype = np.uint8

This comment has been minimized.

@jreback

jreback Nov 20, 2017

Contributor

need a dtype = np.dtype(dtype)

return result
def _get_dummies_1d(data, prefix, prefix_sep='_', dummy_na=False,
sparse=False, drop_first=False):
sparse=False, drop_first=False, dtype=np.uint8):

This comment has been minimized.

@jreback

jreback Nov 20, 2017

Contributor

?

@TomAugspurger

Gave another quick glance, and things look good here.

# not that you should do this...
df = self.df
result = get_dummies(df, prefix='bad', sparse=self.sparse)
df[['C']] = df[['C']].astype(np.uint8)

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 20, 2017

Contributor

Why the change here?

This comment has been minimized.

@Scorpil

Scorpil Nov 20, 2017

Contributor

This test is a bit weird... Having 2 columns with identical names caused ValueError when I tried expected['C'] = expected['C']. Now, when you mentioned it, I see in diff that expected = expected.astype({"C": np.int64}) should work, I'll put it back.

df.loc[3, :] = [np.nan, np.nan, np.nan]
result = get_dummies(df, dummy_na=True, sparse=self.sparse)
result = get_dummies(df, dummy_na=True,

This comment has been minimized.

@TomAugspurger

TomAugspurger Nov 20, 2017

Contributor

Did you add the sorting because the output changed, or to make the test easier to write?

I slightly prefer the explicit ordering rather than sorting, though that'll be covered elsewhere so changing it isn't a huge deal.

This comment has been minimized.

@Scorpil

Scorpil Nov 20, 2017

Contributor

Just a bit easier to write.

@jreback

lgtm. small doc changes. have a look in reshaping.rst if any doc updates are needed.

@@ -28,6 +28,20 @@ Other Enhancements
- :class:`pandas.io.formats.style.Styler` now has method ``hide_index()`` to determine whether the index will be rendered in ouptut (:issue:`14194`)
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`)
- Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`)
- :func:`get_dummies` now supports ``dtype`` argument, see :ref:`here <whatsnew_0220.enhancements.get_dummies_dtype>` for more (:issue:`18330`)

This comment has been minimized.

@jreback

jreback Nov 22, 2017

Contributor

you can remove this line, its already covered in the sub-section. move the sub-section before other enhancements

return np.uint8
return dtype
def test_throws_on_dtype_object(self, df):

This comment has been minimized.

@jreback

jreback Nov 22, 2017

Contributor

throws -> raises

pd.get_dummies(df, dtype=bool).dtypes
.. versionadded:: 0.22.0

This comment has been minimized.

@jreback

jreback Nov 22, 2017

Contributor

ca you move to before the example

``get_dummies`` now supports ``dtype`` argument
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a specific dtype for the new columns. When ``dtype`` is not specified or ``None``, the dtype will be ``uint8`` as before. (:issue:`18330`)

This comment has been minimized.

@jreback

jreback Nov 22, 2017

Contributor

just say the default remains uint8

This comment has been minimized.

@Scorpil

Scorpil Nov 22, 2017

Contributor

Done. Also removed useless 'specific' in "specifies a specific dtype".

@@ -240,7 +240,7 @@ values will be set to ``NaN``.
df3
df3.unstack()
.. versionadded: 0.18.0
.. versionadded:: 0.18.0

This comment has been minimized.

@Scorpil

Scorpil Nov 22, 2017

Contributor

This was a typo, right? There are couple more places where second double column is missing:

pandas/core/frame.py
4516:            .. versionadded: 0.18.0
4679:            .. versionadded: 0.16.1

pandas/core/generic.py
968:            .. versionadded: 0.21.0

pandas/core/series.py
1629:            .. versionadded: 0.19.0
2216:            .. versionadded: 0.18.0

pandas/core/tools/datetimes.py
117:        .. versionadded: 0.18.1
143:        .. versionadded: 0.16.1
181:        .. versionadded: 0.20.0
187:        .. versionadded: 0.22.0

pandas/tseries/offsets.py
778:    .. versionadded: 0.16.1
882:    .. versionadded: 0.18.1

This comment has been minimized.

@jreback

jreback Nov 22, 2017

Contributor

hmm yes looks that way. would be great if you can update those! (if you really want to could also add a lint rule to search for these and fail the build if they are found) (also in doc dir too).

This comment has been minimized.

@Scorpil

Scorpil Nov 22, 2017

Contributor

Separate PR or this will do?

This comment has been minimized.

@jreback

jreback Nov 22, 2017

Contributor

separate PR prob better. (the one you changed already is fine). I think we DO want to add some more generic checks for these formatting tags, I guess sphinx doesn't complain

This comment has been minimized.

@Scorpil

Scorpil Nov 22, 2017

Contributor

Yeah, it's just comments for sphinx. I'll create an issue then, and see what I can do when I have time to look into it. Or somebody will pick it up before that, which is also fine :)

This comment has been minimized.

@Scorpil
# Series avoids inconsistent NaN handling
codes, levels = _factorize_from_iterable(Series(data))
if dtype is None:
dtype = np.uint8
else:

This comment has been minimized.

@jreback

jreback Nov 22, 2017

Contributor
if dtype is None
    dtype = np.uint8
dtype = np.dtype(dtype)

a bit more idiomatic

@jreback

small comments. ping on green.

@jreback jreback added this to the 0.22.0 milestone Nov 22, 2017

@Scorpil

This comment has been minimized.

Contributor

Scorpil commented Nov 22, 2017

@jreback it's green.

@jreback jreback merged commit fedc503 into pandas-dev:master Nov 22, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Contributor

jreback commented Nov 22, 2017

thanks @Scorpil nice patch! keem em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment