BUG: multi-type SparseDataFrame fixes and improvements #13917

Closed
wants to merge 15 commits into
from

Conversation

Projects
None yet
4 participants
Contributor

sstanovnik commented Aug 5, 2016 edited

  • 1 tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

Types were incorrectly determined when slicing SparseDataFrames with
multiple dtypes (such as float and object) into SparseSeries.

import pandas as pd
sdf = pd.SparseDataFrame([[1, 2, 'a'], [4, 5, 'b']])
sdf.iloc[[0]]  # OK
sdf.iloc[0]  # ValueError: could not convert string to float: 'a'

No existing issue covers this.

codecov-io commented Aug 5, 2016 edited

Current coverage is 85.30% (diff: 100%)

Merging #13917 into master will not change coverage

@@             master     #13917   diff @@
==========================================
  Files           139        139          
  Lines         50157      50157          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42785      42785          
  Misses         7372       7372          
  Partials          0          0          

Powered by Codecov. Last update b7abef4...8c7d1ea

@jreback jreback commented on an outdated diff Aug 5, 2016

pandas/core/internals.py
@@ -4440,7 +4440,10 @@ def _lcd_dtype(l):
""" find the lowest dtype that can accomodate the given types """
m = l[0].dtype
for x in l[1:]:
- if x.dtype.itemsize > m.itemsize:
+ # the new dtype must either be wider or a strict subtype
+ if (x.dtype.itemsize > m.itemsize or
@jreback

jreback Aug 5, 2016

Contributor

doesn't np.find_common_type do this? (or .common_type)?

should create a pandas version of this to isolate (e.g. this won't handle non-numpy types)

Contributor

jreback commented Aug 5, 2016

Member

sinhrks commented Aug 5, 2016

as sparse mainly supports float64, adding more tests for mixed dtype is appreciated (maybe in the same file as separate class).

sstanovnik changed the title from BUG: slicing single rows of multi-type SparseDataFrames. to BUG: multi-type SparseDataFrame fixes and improvements Aug 8, 2016

Contributor

sstanovnik commented Aug 8, 2016

I used numpy's find_common_type instead of that local function, this changed a test (test_block_internals.py), see if that numpy behaviour is desired.

Another (maybe non-minor) change is the default parameter in sparse/array.py, but that didn't seem to really break anything.

I also added some new tests as suggested, but don't feel confident in adding the proposed pandas alternative to the numpy find_common_type - my knowledge of pandas dtypes isn't really great. It would likely be similar to _interleaved_dtype in core/internals.py right?

@jreback jreback commented on the diff Aug 8, 2016

pandas/core/internals.py
@@ -4435,14 +4435,6 @@ def _interleaved_dtype(blocks):
for x in blocks:
counts[type(x)].append(x)
- def _lcd_dtype(l):
- """ find the lowest dtype that can accomodate the given types """
- m = l[0].dtype
- for x in l[1:]:
@jreback

jreback Aug 8, 2016

Contributor

what I meant was can start off by simply moving this (your version or new one with np.find_common_type to pandas.types.cast (and if we have specific tests move similarly; if not, ideally add some). we can add the pandas specific functionaility later.

Member

sinhrks commented Aug 8, 2016

can u also add tests to check normal SparseArray and its dtypes (related to the default change)?

current default (float64) is because it doesn't fully support int and bool, should be solved by #13849.

sstanovnik added some commits Aug 9, 2016

@sstanovnik sstanovnik Revert default argument change. 6782bc7
@sstanovnik sstanovnik Factor the common type discovery to an internal function.
2104948
Contributor

sstanovnik commented Aug 9, 2016

It turns out this PR works without the default argument change, I was too hasty to change it. Your PR fixes that better, so I reverted the change.

Common type discovery moved to types/cast.py:_find_common_type. I did notice, however, that types/common.py:_lcd_dtypes exists (seems to only work with numpy dtypes, but differently than np.find_common_type), but changing either implementation to the other broke some things, so I left it as it is.

Contributor

jreback commented Aug 9, 2016

yep, need to fix this.

@sstanovnik can you create a new issue for this.

[jreback-~/pandas] grin _lcd_dtype pandas
pandas/core/frame.py:
   47 :                                  _lcd_dtypes,
 3705 :                 new_dtype = _lcd_dtypes(this_dtype, other_dtype)
pandas/core/internals.py:
 4438 :     def _lcd_dtype(l):
 4473 :         lcd = _lcd_dtype(counts[IntBlock])
 4489 :         return _lcd_dtype(counts[FloatBlock] + counts[SparseBlock])
pandas/types/common.py:
  389 : def _lcd_dtypes(a_dtype, b_dtype):

@jreback jreback commented on the diff Aug 9, 2016

pandas/tests/frame/test_block_internals.py
values = self.mixed_int.as_matrix(['A', 'D'])
self.assertEqual(values.dtype, np.int64)
- # guess all ints are cast to uints....
+ # B uint64 forces float because there are other signed int types
@jreback

jreback Aug 9, 2016

Contributor

this might fix another bug, can you search for uint64 issues and see?

@jreback

jreback Aug 9, 2016

Contributor

add the issue as a reference here
and in the whatsnew

@jreback jreback commented on an outdated diff Aug 9, 2016

pandas/tests/frame/test_indexing.py
@@ -2713,6 +2713,65 @@ def test_type_error_multiindex(self):
assert_series_equal(result, expected)
+class TestSparseDataFrameMultitype(tm.TestCase):
@jreback

jreback Aug 9, 2016

Contributor

these need to go in pandas/spare/tests (maybe new file inside there if nothing) appropriate)

@jreback jreback commented on an outdated diff Aug 9, 2016

pandas/tests/series/test_indexing.py
@@ -1856,3 +1856,29 @@ def test_multilevel_preserve_name(self):
result2 = s.ix['foo']
self.assertEqual(result.name, s.name)
self.assertEqual(result2.name, s.name)
+
+
@jreback

jreback Aug 9, 2016

Contributor

same here

@jreback jreback commented on the diff Aug 9, 2016

pandas/types/cast.py
@@ -861,3 +861,9 @@ def _possibly_cast_to_datetime(value, dtype, errors='raise'):
value = _possibly_infer_to_datetimelike(value)
return value
+
+
@jreback

jreback Aug 9, 2016

Contributor

can you add some tests for validation on this (there might be some existing)

@jreback

jreback Aug 9, 2016

Contributor

IOW, just run it thru the standard types for now, and asserting that it raises for pandas dtypes

sstanovnik referenced this pull request Aug 9, 2016

Closed

Unified common dtype discovery #13947

0 of 2 tasks complete
Contributor

sstanovnik commented Aug 9, 2016

Opened issue. Moved the tests, added new tests. Found and processed #10364.

@jreback jreback commented on an outdated diff Aug 9, 2016

pandas/tests/types/test_cast.py
@@ -188,6 +191,42 @@ def test_possibly_convert_objects_copy(self):
self.assertTrue(values is not out)
+class TestCommonTypes(tm.TestCase):
+ def setUp(self):
+ super(TestCommonTypes, self).setUp()
@jreback

jreback Aug 9, 2016

Contributor

you don't need setUp here

@jreback jreback commented on an outdated diff Aug 9, 2016

pandas/tests/types/test_cast.py
@@ -188,6 +191,42 @@ def test_possibly_convert_objects_copy(self):
self.assertTrue(values is not out)
+class TestCommonTypes(tm.TestCase):
+ def setUp(self):
+ super(TestCommonTypes, self).setUp()
+
+ def test_numpy_dtypes(self):
+ self.assertEqual(_find_common_type([np.int64]), np.int64)
+ self.assertEqual(_find_common_type([np.uint64]), np.uint64)
+ self.assertEqual(_find_common_type([np.float32]), np.float32)
+ self.assertEqual(_find_common_type([np.object]), np.object)
+
+ self.assertEqual(_find_common_type([np.int16, np.int64]),
+ np.int64)
@jreback

jreback Aug 9, 2016

Contributor

group them

eg ints

floats

easier to read

@jreback jreback commented on an outdated diff Aug 9, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -437,6 +437,7 @@ API changes
- ``pd.Timedelta(None)`` is now accepted and will return ``NaT``, mirroring ``pd.Timestamp`` (:issue:`13687`)
- ``Timestamp``, ``Period``, ``DatetimeIndex``, ``PeriodIndex`` and ``.dt`` accessor have gained a ``.is_leap_year`` property to check whether the date belongs to a leap year. (:issue:`13727`)
- ``pd.read_hdf`` will now raise a ``ValueError`` instead of ``KeyError``, if a mode other than ``r``, ``r+`` and ``a`` is supplied. (:issue:`13623`)
+- ``.values`` will now return ``np.float64`` with a ``DataFrame`` with ``np.int64`` and ``np.uint64`` dtypes, conforming to ``np.find_common_type`` (:issue:`10364`, :issue:`13917`)
@jreback

jreback Aug 9, 2016

Contributor

DataFrame.values will now ....with a frame of mixed int64 and uint64 dtypes.....

@jreback jreback commented on an outdated diff Aug 9, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -764,6 +765,7 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan`
- Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value`` (:issue:`13866`)
- Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`)
- Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
+- Bug when interacting with multi-type SparseDataFrames: single row slicing now works because types are not forced to float (:issue:`13917`)
@jreback

jreback Aug 9, 2016

Contributor

Bug in single row slicing on multi-dtype SparseDataFrame s.....

@jreback jreback commented on an outdated diff Aug 9, 2016

pandas/sparse/tests/test_multitype.py
@@ -0,0 +1,88 @@
+import numpy as np
+import pandas as pd
+import pandas.util.testing as tm
+
+
+class TestSparseDataFrameMultitype(tm.TestCase):
+ def setUp(self):
+ super(TestSparseDataFrameMultitype, self).setUp()
@jreback

jreback Aug 9, 2016

Contributor

you don't need the super

@jreback jreback and 1 other commented on an outdated diff Aug 9, 2016

pandas/sparse/tests/test_multitype.py
+ self.cols = ['string', 'int', 'float', 'object']
+ self.sdf = self.sdf[self.cols]
+
+ def test_basic_dtypes(self):
+ for _, row in self.sdf.iterrows():
+ self.assertEqual(row.dtype, object)
+ tm.assert_sp_series_equal(self.sdf['string'], self.string_series,
+ check_names=False)
+ tm.assert_sp_series_equal(self.sdf['int'], self.int_series,
+ check_names=False)
+ tm.assert_sp_series_equal(self.sdf['float'], self.float_series,
+ check_names=False)
+ tm.assert_sp_series_equal(self.sdf['object'], self.object_series,
+ check_names=False)
+
+ def test_indexing_single(self):
@jreback

jreback Aug 9, 2016

Contributor

I think we have a test_indexing.py in sparse

@sstanovnik

sstanovnik Aug 9, 2016

Contributor

Make a TestMultitype class in indexing.py instead of having a separate file?

@jreback

jreback Aug 9, 2016

Contributor

yes.

@jreback jreback commented on an outdated diff Aug 9, 2016

pandas/sparse/tests/test_multitype.py
+ tm.assert_sp_frame_equal(self.sdf[['int', 'string']],
+ pd.SparseDataFrame({
+ 'int': self.int_series,
+ 'string': self.string_series,
+ }))
+
+
+class TestSparseSeriesMultitype(tm.TestCase):
+ def setUp(self):
+ super(TestSparseSeriesMultitype, self).setUp()
+ self.index = ['string', 'int', 'float', 'object']
+ self.ss = pd.SparseSeries(['a', 1, 1.1, []],
+ index=self.index)
+
+ def test_indexing_single(self):
+ for i, idx in enumerate(self.index):
@jreback

jreback Aug 9, 2016

Contributor

same here

@jreback jreback commented on an outdated diff Aug 9, 2016

pandas/tests/types/test_cast.py
+ # identity
+ self.assertEqual(_find_common_type([np.int64]), np.int64)
+ self.assertEqual(_find_common_type([np.uint64]), np.uint64)
+ self.assertEqual(_find_common_type([np.float32]), np.float32)
+ self.assertEqual(_find_common_type([np.object]), np.object)
+
+ # into ints
+ self.assertEqual(_find_common_type([np.int16, np.int64]),
+ np.int64)
+ self.assertEqual(_find_common_type([np.int32, np.uint32]),
+ np.int64)
+ self.assertEqual(_find_common_type([np.uint16, np.uint64]),
+ np.uint64)
+
+ # into floats
+ self.assertEqual(_find_common_type([np.float16, np.float32]),
@jreback

jreback Aug 9, 2016

Contributor

make this shorter e.g.

for types in [(np.float16, np.int16), .....]:
     self.assertEqual(_find_common_type(types, np.float32)

and similarly as much as possible

@jreback jreback commented on an outdated diff Aug 9, 2016

pandas/tests/types/test_cast.py
+ np.float64)
+ self.assertEqual(_find_common_type([np.int16, np.float64]),
+ np.float64)
+ self.assertEqual(_find_common_type([np.float16, np.int64]),
+ np.float64)
+
+ # into others
+ self.assertEqual(_find_common_type([np.complex128, np.int32]),
+ np.complex128)
+ self.assertEqual(_find_common_type([np.object, np.float32]),
+ np.object)
+ self.assertEqual(_find_common_type([np.object, np.int16]),
+ np.object)
+
+ def test_pandas_dtypes(self):
+ with self.assertRaises(TypeError):
@jreback

jreback Aug 9, 2016

Contributor

put a TODO: this is not implemented

@jreback jreback and 1 other commented on an outdated diff Aug 9, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -764,6 +765,7 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan`
- Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value`` (:issue:`13866`)
- Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`)
- Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
+- Bug in single row slicing on multi-type ``SparseDataFrame``s: types were previously forced to float (:issue:`13917`)
@jreback

jreback Aug 9, 2016

Contributor

, types where previously...

@sstanovnik

sstanovnik Aug 9, 2016

Contributor

This reads fine to me? There's a colon after SparseDataFrames

@jreback

jreback Aug 9, 2016

Contributor

change : to ,

@sstanovnik

sstanovnik Aug 9, 2016

Contributor

Oh wait sorry I misread your comment.

@jreback jreback commented on the diff Aug 9, 2016

pandas/tests/types/test_cast.py
+ ((np.uint16, np.uint64), np.uint64),
+
+ # into floats
+ ((np.float16, np.float32), np.float32),
+ ((np.float16, np.int16), np.float32),
+ ((np.float32, np.int16), np.float32),
+ ((np.uint64, np.int64), np.float64),
+ ((np.int16, np.float64), np.float64),
+ ((np.float16, np.int64), np.float64),
+
+ # into others
+ ((np.complex128, np.int32), np.complex128),
+ ((np.object, np.float32), np.object),
+ ((np.object, np.int16), np.object),
+ )
+ for src, common in testcases:
@jreback

jreback Aug 9, 2016

Contributor

nice!

Contributor

jreback commented Aug 9, 2016

lgtm. @sinhrks ?

jreback added this to the 0.19.0 milestone Aug 9, 2016

@sstanovnik sstanovnik Colon to comma.
8c7d1ea
Contributor

sstanovnik commented Aug 9, 2016

Thanks for your patience.

Contributor

jreback commented Aug 9, 2016

ha! thanks for yours

Member

sinhrks commented Aug 9, 2016

lgtm, thx @sstanovnik ! find_common_type is needed for #13849 also :)

jreback closed this in 0e7ae89 Aug 10, 2016

Contributor

jreback commented Aug 10, 2016

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment