Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
BUG: multi-type SparseDataFrame fixes and improvements #13917
Conversation
codecov-io
commented
Aug 5, 2016
•
Current coverage is 85.30% (diff: 100%)@@ master #13917 diff @@
==========================================
Files 139 139
Lines 50157 50157
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 42785 42785
Misses 7372 7372
Partials 0 0
|
jreback
commented on an outdated diff
Aug 5, 2016
| @@ -4440,7 +4440,10 @@ def _lcd_dtype(l): | ||
| """ find the lowest dtype that can accomodate the given types """ | ||
| m = l[0].dtype | ||
| for x in l[1:]: | ||
| - if x.dtype.itemsize > m.itemsize: | ||
| + # the new dtype must either be wider or a strict subtype | ||
| + if (x.dtype.itemsize > m.itemsize or |
jreback
Contributor
|
jreback
added Indexing Dtypes
labels
Aug 5, 2016
|
as sparse mainly supports |
sstanovnik
added some commits
Aug 5, 2016
sstanovnik
changed the title from
BUG: slicing single rows of multi-type SparseDataFrames. to BUG: multi-type SparseDataFrame fixes and improvements
Aug 8, 2016
|
I used numpy's find_common_type instead of that local function, this changed a test ( Another (maybe non-minor) change is the default parameter in I also added some new tests as suggested, but don't feel confident in adding the proposed pandas alternative to the numpy find_common_type - my knowledge of pandas dtypes isn't really great. It would likely be similar to |
jreback
commented on the diff
Aug 8, 2016
| @@ -4435,14 +4435,6 @@ def _interleaved_dtype(blocks): | ||
| for x in blocks: | ||
| counts[type(x)].append(x) | ||
| - def _lcd_dtype(l): | ||
| - """ find the lowest dtype that can accomodate the given types """ | ||
| - m = l[0].dtype | ||
| - for x in l[1:]: |
jreback
Contributor
|
|
can u also add tests to check normal current default ( |
sstanovnik
added some commits
Aug 9, 2016
|
It turns out this PR works without the default argument change, I was too hasty to change it. Your PR fixes that better, so I reverted the change. Common type discovery moved to |
|
yep, need to fix this. @sstanovnik can you create a new issue for this.
|
jreback
commented on the diff
Aug 9, 2016
| values = self.mixed_int.as_matrix(['A', 'D']) | ||
| self.assertEqual(values.dtype, np.int64) | ||
| - # guess all ints are cast to uints.... | ||
| + # B uint64 forces float because there are other signed int types |
jreback
Contributor
|
jreback
commented on an outdated diff
Aug 9, 2016
| @@ -2713,6 +2713,65 @@ def test_type_error_multiindex(self): | ||
| assert_series_equal(result, expected) | ||
| +class TestSparseDataFrameMultitype(tm.TestCase): |
jreback
Contributor
|
jreback
commented on an outdated diff
Aug 9, 2016
jreback
commented on the diff
Aug 9, 2016
| @@ -861,3 +861,9 @@ def _possibly_cast_to_datetime(value, dtype, errors='raise'): | ||
| value = _possibly_infer_to_datetimelike(value) | ||
| return value | ||
| + | ||
| + |
jreback
Contributor
|
sstanovnik
added some commits
Aug 9, 2016
|
Opened issue. Moved the tests, added new tests. Found and processed #10364. |
jreback
commented on an outdated diff
Aug 9, 2016
jreback
commented on an outdated diff
Aug 9, 2016
| @@ -188,6 +191,42 @@ def test_possibly_convert_objects_copy(self): | ||
| self.assertTrue(values is not out) | ||
| +class TestCommonTypes(tm.TestCase): | ||
| + def setUp(self): | ||
| + super(TestCommonTypes, self).setUp() | ||
| + | ||
| + def test_numpy_dtypes(self): | ||
| + self.assertEqual(_find_common_type([np.int64]), np.int64) | ||
| + self.assertEqual(_find_common_type([np.uint64]), np.uint64) | ||
| + self.assertEqual(_find_common_type([np.float32]), np.float32) | ||
| + self.assertEqual(_find_common_type([np.object]), np.object) | ||
| + | ||
| + self.assertEqual(_find_common_type([np.int16, np.int64]), | ||
| + np.int64) |
|
|
sstanovnik
added some commits
Aug 9, 2016
jreback
commented on an outdated diff
Aug 9, 2016
| @@ -437,6 +437,7 @@ API changes | ||
| - ``pd.Timedelta(None)`` is now accepted and will return ``NaT``, mirroring ``pd.Timestamp`` (:issue:`13687`) | ||
| - ``Timestamp``, ``Period``, ``DatetimeIndex``, ``PeriodIndex`` and ``.dt`` accessor have gained a ``.is_leap_year`` property to check whether the date belongs to a leap year. (:issue:`13727`) | ||
| - ``pd.read_hdf`` will now raise a ``ValueError`` instead of ``KeyError``, if a mode other than ``r``, ``r+`` and ``a`` is supplied. (:issue:`13623`) | ||
| +- ``.values`` will now return ``np.float64`` with a ``DataFrame`` with ``np.int64`` and ``np.uint64`` dtypes, conforming to ``np.find_common_type`` (:issue:`10364`, :issue:`13917`) |
jreback
Contributor
|
jreback
commented on an outdated diff
Aug 9, 2016
| @@ -764,6 +765,7 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan` | ||
| - Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value`` (:issue:`13866`) | ||
| - Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`) | ||
| - Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`) | ||
| +- Bug when interacting with multi-type SparseDataFrames: single row slicing now works because types are not forced to float (:issue:`13917`) |
|
|
jreback
commented on an outdated diff
Aug 9, 2016
jreback
and 1 other
commented on an outdated diff
Aug 9, 2016
| + self.cols = ['string', 'int', 'float', 'object'] | ||
| + self.sdf = self.sdf[self.cols] | ||
| + | ||
| + def test_basic_dtypes(self): | ||
| + for _, row in self.sdf.iterrows(): | ||
| + self.assertEqual(row.dtype, object) | ||
| + tm.assert_sp_series_equal(self.sdf['string'], self.string_series, | ||
| + check_names=False) | ||
| + tm.assert_sp_series_equal(self.sdf['int'], self.int_series, | ||
| + check_names=False) | ||
| + tm.assert_sp_series_equal(self.sdf['float'], self.float_series, | ||
| + check_names=False) | ||
| + tm.assert_sp_series_equal(self.sdf['object'], self.object_series, | ||
| + check_names=False) | ||
| + | ||
| + def test_indexing_single(self): |
sstanovnik
Contributor
|
jreback
commented on an outdated diff
Aug 9, 2016
| + tm.assert_sp_frame_equal(self.sdf[['int', 'string']], | ||
| + pd.SparseDataFrame({ | ||
| + 'int': self.int_series, | ||
| + 'string': self.string_series, | ||
| + })) | ||
| + | ||
| + | ||
| +class TestSparseSeriesMultitype(tm.TestCase): | ||
| + def setUp(self): | ||
| + super(TestSparseSeriesMultitype, self).setUp() | ||
| + self.index = ['string', 'int', 'float', 'object'] | ||
| + self.ss = pd.SparseSeries(['a', 1, 1.1, []], | ||
| + index=self.index) | ||
| + | ||
| + def test_indexing_single(self): | ||
| + for i, idx in enumerate(self.index): |
|
|
jreback
commented on an outdated diff
Aug 9, 2016
| + # identity | ||
| + self.assertEqual(_find_common_type([np.int64]), np.int64) | ||
| + self.assertEqual(_find_common_type([np.uint64]), np.uint64) | ||
| + self.assertEqual(_find_common_type([np.float32]), np.float32) | ||
| + self.assertEqual(_find_common_type([np.object]), np.object) | ||
| + | ||
| + # into ints | ||
| + self.assertEqual(_find_common_type([np.int16, np.int64]), | ||
| + np.int64) | ||
| + self.assertEqual(_find_common_type([np.int32, np.uint32]), | ||
| + np.int64) | ||
| + self.assertEqual(_find_common_type([np.uint16, np.uint64]), | ||
| + np.uint64) | ||
| + | ||
| + # into floats | ||
| + self.assertEqual(_find_common_type([np.float16, np.float32]), |
jreback
Contributor
|
jreback
commented on an outdated diff
Aug 9, 2016
| + np.float64) | ||
| + self.assertEqual(_find_common_type([np.int16, np.float64]), | ||
| + np.float64) | ||
| + self.assertEqual(_find_common_type([np.float16, np.int64]), | ||
| + np.float64) | ||
| + | ||
| + # into others | ||
| + self.assertEqual(_find_common_type([np.complex128, np.int32]), | ||
| + np.complex128) | ||
| + self.assertEqual(_find_common_type([np.object, np.float32]), | ||
| + np.object) | ||
| + self.assertEqual(_find_common_type([np.object, np.int16]), | ||
| + np.object) | ||
| + | ||
| + def test_pandas_dtypes(self): | ||
| + with self.assertRaises(TypeError): |
|
|
jreback
and 1 other
commented on an outdated diff
Aug 9, 2016
| @@ -764,6 +765,7 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan` | ||
| - Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value`` (:issue:`13866`) | ||
| - Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`) | ||
| - Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`) | ||
| +- Bug in single row slicing on multi-type ``SparseDataFrame``s: types were previously forced to float (:issue:`13917`) |
|
|
jreback
commented on the diff
Aug 9, 2016
| + ((np.uint16, np.uint64), np.uint64), | ||
| + | ||
| + # into floats | ||
| + ((np.float16, np.float32), np.float32), | ||
| + ((np.float16, np.int16), np.float32), | ||
| + ((np.float32, np.int16), np.float32), | ||
| + ((np.uint64, np.int64), np.float64), | ||
| + ((np.int16, np.float64), np.float64), | ||
| + ((np.float16, np.int64), np.float64), | ||
| + | ||
| + # into others | ||
| + ((np.complex128, np.int32), np.complex128), | ||
| + ((np.object, np.float32), np.object), | ||
| + ((np.object, np.int16), np.object), | ||
| + ) | ||
| + for src, common in testcases: |
|
|
|
lgtm. @sinhrks ? |
jreback
added this to the
0.19.0
milestone
Aug 9, 2016
|
Thanks for your patience. |
|
ha! thanks for yours |
|
lgtm, thx @sstanovnik ! |
jreback
closed this
in 0e7ae89
Aug 10, 2016
|
thanks! |
sstanovnik commentedAug 5, 2016
•
edited
git diff upstream/master | flake8 --diffTypes were incorrectly determined when slicing SparseDataFrames with
multiple dtypes (such as float and object) into SparseSeries.
No existing issue covers this.