BUG: Fix make_sparse mask generation #17574

Licht-T · 2017-09-18T15:54:19Z

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is the part of #17386.
The mask generation sequence in make_sparse treated two data, which have same bits, as same if array is dtype=object.
This bug made SparseArray([False, 0, 100.0, 0.0], fill_value=False, dtype=np.object) is evaluated as [False, False, 100.0, False]

jreback

add a whatsnew entry for this ; use the PR number as issue

jreback · 2017-09-21T13:43:36Z

pandas/core/sparse/array.py

+                    mask.append(e != fill_value)
+                else:
+                    mask.append(True)
+            mask = np.array(mask)


this is very awkward

you can do this by starting with an array of True
the using .flat to set only the portion you need and do that with a list comprehension

jreback · 2017-09-21T13:44:38Z

pandas/tests/sparse/test_array.py

@@ -61,6 +61,12 @@ def test_constructor_object_dtype(self):
        assert arr.dtype == np.object
        assert arr.fill_value == 'A'

+        data = [False, 0, 100.0, 0.0]


add the issues number as a comment

pep8speaks · 2017-09-22T02:42:07Z

Hello @Licht-T! Thanks for updating the PR.

In the file asv_bench/benchmarks/sparse.py, following are the PEP8 issues :

Line 31:80: E501 line too long (119 > 79 characters)
Line 32:80: E501 line too long (117 > 79 characters)
Line 34:80: E501 line too long (128 > 79 characters)
Line 35:80: E501 line too long (126 > 79 characters)
Line 37:80: E501 line too long (123 > 79 characters)
Line 38:80: E501 line too long (121 > 79 characters)
Line 40:80: E501 line too long (122 > 79 characters)
Line 41:80: E501 line too long (120 > 79 characters)
Line 80:80: E501 line too long (80 > 79 characters)

Comment last updated on September 27, 2017 at 11:39 Hours UTC

codecov · 2017-09-22T02:42:26Z

Codecov Report

Merging #17574 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17574      +/-   ##
==========================================
- Coverage   91.22%    91.2%   -0.02%     
==========================================
  Files         163      163              
  Lines       49625    49632       +7     
==========================================
- Hits        45270    45268       -2     
- Misses       4355     4364       +9

Flag	Coverage Δ
#multiple	`88.99% <100%> (ø)`	⬆️
#single	`40.19% <25%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/sparse/array.py	`91.44% <100%> (+0.14%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.77% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37e23d0...05bbd54. Read the comment docs.

codecov · 2017-09-22T02:42:30Z

Codecov Report

Merging #17574 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17574      +/-   ##
==========================================
+ Coverage   91.24%   91.25%   +<.01%     
==========================================
  Files         163      163              
  Lines       49762    49764       +2     
==========================================
+ Hits        45406    45412       +6     
+ Misses       4356     4352       -4

Flag	Coverage Δ
#multiple	`89.05% <100%> (+0.02%)`	⬆️
#single	`40.34% <66.66%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/sparse/array.py	`91.58% <100%> (+0.28%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.73% <0%> (-0.1%)`	⬇️
pandas/plotting/_converter.py	`65.2% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44747c8...2c6bca4. Read the comment docs.

Licht-T · 2017-09-22T02:51:50Z

@jreback Thanks for your review.

Fixed the mask generation method
Fixed the insufficient array comparison method in test case

jreback · 2017-09-22T13:18:37Z

pandas/core/sparse/array.py

+            mask = np.ones(len(arr), dtype=np.bool)
+            fv_type = type(fill_value)
+
+            itr = (type(x) is fv_type for x in arr)


this is still quite cumbersome can you simplify

Licht-T · 2017-09-22T15:50:08Z

@jreback The method is now simplified.

jreback · 2017-09-22T16:21:49Z

pandas/core/sparse/array.py

@@ -789,7 +790,15 @@ def make_sparse(arr, kind='block', fill_value=None):
        if is_string_dtype(arr):
            arr = arr.astype(object)

-        mask = arr != fill_value
+        if is_object_dtype(arr.dtype):
+            mask = np.ones(len(arr), dtype=np.bool)


ok pls add a comment on what/why this is happening.

no general numpy routine to do this? (I don't think we have a pandas one), but have a look around.

@jreback Added comments. I haven't found yet. I think there is no numpy general method.

jreback

can you add some asvs for sparse? (if you prefer another issue that's ok too). Need to benchmarks things like this.

Licht-T · 2017-09-24T17:10:36Z

@jreback Added asvs for SparseArray constructor, but it does not run due to the bug in asv_bench/benchmarks/sparse.py.
I created another PR: #17659.

jreback · 2017-09-24T19:06:12Z

what do benchmarks show?

Licht-T · 2017-09-26T12:59:54Z

@jreback Here is the results of tests.
This is pretty much slow. Should I reimplement by Cython?

[asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor                        21:39:21  ☁  fix-make_spase-mask-generation ☂ 𝝙 ⚡ ✭ 𝝙
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks)
[  0.00%] · For pandas commit hash a48f9572:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  8.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                            39.3±2ms
[ 16.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                             7.67±1ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                            39.6±0.4ms
[ 33.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                             6.69±0.3ms
[ 41.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent                                                             556±20ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent                                                               486±6ms
[ 50.00%] · For pandas commit hash 5279a172:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...................................................................
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 58.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                            37.3±3ms
[ 66.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                           6.70±0.3ms
[ 75.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                              38.0±1ms
[ 83.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                             6.16±0.1ms
[ 91.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent                                                             53.8±2ms
[100.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent                                                            32.4±0.5ms
       before           after         ratio
     [5279a172]       [a48f9572]
+      32.4±0.5ms          486±6ms    15.01  sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent
+        53.8±2ms         556±20ms    10.34  sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2017-09-26T13:02:24Z

@Licht-T yes that is slow. Sure if you can do that in cython would be great. its only a single case so shouldn't be much code. not sure how much speedup to get though as you can't really type anything.

jreback · 2017-09-26T13:03:54Z

also, would actually be ok with NOT allowing a not-nan fill_value for object dtypes as well. Slightly more inflexible but generally using sparse with object dtypes is not very performant anyhow.

Licht-T · 2017-09-26T14:17:43Z

@jreback Here is the Cython result.

[asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor                        23:04:10  ☁  fix-make_spase-mask-generation-new ☂ ⚡ ✭
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks)
[  0.00%] · For pandas commit hash ae079f69:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..................................................................
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  8.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                          36.9±0.6ms
[ 16.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                           6.91±0.4ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                              38.5±2ms
[ 33.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                            5.88±0.05ms
[ 41.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent                                                             85.9±2ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent                                                            52.7±0.6ms

Licht-T · 2017-09-26T14:23:37Z

This is rough reimplement and only checking whether this passes tests. It might be a little more faster.

…ay by Cython

Licht-T · 2017-09-26T15:44:27Z

@jreback Here is the final result. NaN and non-NaN are almost same.

[asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor                         0:35:38  ☁  fix-make_spase-mask-generation ☂ 𝝙 ⚡ ✭ 𝝙
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 16 total benchmarks (2 commits * 1 environments * 8 benchmarks)
[  0.00%] · For pandas commit hash 35890726:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  6.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                            40.7±1ms
[ 12.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                           7.34±0.1ms
[ 18.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                            37.3±0.6ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                            6.34±0.09ms
[ 31.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_10percent                                              78.2±1ms
[ 37.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_1percent                                             44.0±0.6ms
[ 43.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent                                          87.0±2ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent                                         51.6±0.5ms
[ 50.00%] · For pandas commit hash 5279a172:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 56.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                            37.2±2ms
[ 62.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                           7.07±0.3ms
[ 68.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                              46.8±3ms
[ 75.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                            6.66±0.05ms
[ 81.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_10percent                                              82.8±1ms
[ 87.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_1percent                                               46.9±1ms
[ 93.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent                                          55.0±2ms
[100.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent                                         34.4±0.9ms
       before           after         ratio
     [5279a172]       [35890726]
+        55.0±2ms         87.0±2ms     1.58  sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent
+      34.4±0.9ms       51.6±0.5ms     1.50  sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent
-        46.8±3ms       37.3±0.6ms     0.80  sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2017-09-26T16:18:15Z

ok this looks good
going to give a final look in a bit

jreback · 2017-09-27T11:42:53Z

lgtm ping on green.

jreback · 2017-09-28T14:11:34Z

thanks @Licht-T keep em coming!

Licht-T changed the title ~~BUG: Fix make_spase mask generation~~ BUG: Fix make_sprase mask generation Sep 18, 2017

Licht-T changed the title ~~BUG: Fix make_sprase mask generation~~ BUG: Fix make_sparse mask generation Sep 18, 2017

BUG: Fix make_sparse mask generation not to cast when dtype is object

7804303

Licht-T force-pushed the fix-make_spase-mask-generation branch from e4d23c0 to 501ba6f Compare September 18, 2017 15:58

TST: Add test of the make_sparse mask generation

b64c123

Licht-T force-pushed the fix-make_spase-mask-generation branch from 501ba6f to b64c123 Compare September 18, 2017 15:59

gfyoung added Bug Sparse Sparse Data Type labels Sep 18, 2017

jreback requested changes Sep 21, 2017

View reviewed changes

TST: Add GitHub PR number comment on the test

bff0ac0

Licht-T added 3 commits September 22, 2017 11:47

BUG: Fix the element-wise mask generation method in make_spase

7190704

DOC: Add the description of make_sparse bug fix in whatsnew note

c9d674a

TST: Fix wrong array comparison in the make_sparse bug-fix test

7727c5a

Licht-T force-pushed the fix-make_spase-mask-generation branch from 05bbd54 to 7727c5a Compare September 22, 2017 02:48

jreback reviewed Sep 22, 2017

View reviewed changes

BUG: Simplified the mask generation method in make_spase

26cb4ca

jreback reviewed Sep 22, 2017

View reviewed changes

jreback requested changes Sep 22, 2017

View reviewed changes

jreback added this to the 0.21.0 milestone Sep 25, 2017

Licht-T added 2 commits September 26, 2017 09:22

TST: Add the SparseArray constructor performance test

8e01026

DOC: Add the description of make_sparse mask generation routine

a48f957

Licht-T force-pushed the fix-make_spase-mask-generation branch from 08f5a22 to a48f957 Compare September 26, 2017 12:54

Licht-T added 2 commits September 27, 2017 00:06

PERF: Reimplement the SparseArray mask create method for object ndarr…

3589072

…ay by Cython

TST: Add asv tests for the object dtype SparseArray with NaN fill value

48687b3

Merge branch 'master' into fix-make_spase-mask-generation

2c6bca4

jreback approved these changes Sep 27, 2017

View reviewed changes

jreback merged commit 45bd471 into pandas-dev:master Sep 28, 2017

TomAugspurger mentioned this pull request Oct 25, 2017

PERF: Regressions since 0.20.3 #17861

Closed

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

BUG: Fix make_sparse mask generation (pandas-dev#17574)

96394f3

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

BUG: Fix make_sparse mask generation (pandas-dev#17574)

5879aa5

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix make_sparse mask generation #17574

BUG: Fix make_sparse mask generation #17574

Licht-T commented Sep 18, 2017 •

edited

Loading

jreback left a comment

jreback Sep 21, 2017

jreback Sep 21, 2017

pep8speaks commented Sep 22, 2017 •

edited

Loading

codecov bot commented Sep 22, 2017

codecov bot commented Sep 22, 2017 •

edited

Loading

Licht-T commented Sep 22, 2017

jreback Sep 22, 2017

Licht-T commented Sep 22, 2017

jreback Sep 22, 2017

Licht-T Sep 24, 2017

jreback left a comment

Licht-T commented Sep 24, 2017

jreback commented Sep 24, 2017

Licht-T commented Sep 26, 2017

jreback commented Sep 26, 2017

jreback commented Sep 26, 2017

Licht-T commented Sep 26, 2017

Licht-T commented Sep 26, 2017

Licht-T commented Sep 26, 2017

jreback commented Sep 26, 2017

jreback commented Sep 27, 2017

jreback commented Sep 28, 2017

BUG: Fix make_sparse mask generation #17574

BUG: Fix make_sparse mask generation #17574

Conversation

Licht-T commented Sep 18, 2017 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback Sep 21, 2017

Choose a reason for hiding this comment

jreback Sep 21, 2017

Choose a reason for hiding this comment

pep8speaks commented Sep 22, 2017 • edited Loading

Comment last updated on September 27, 2017 at 11:39 Hours UTC

codecov bot commented Sep 22, 2017

Codecov Report

codecov bot commented Sep 22, 2017 • edited Loading

Codecov Report

Licht-T commented Sep 22, 2017

jreback Sep 22, 2017

Choose a reason for hiding this comment

Licht-T commented Sep 22, 2017

jreback Sep 22, 2017

Choose a reason for hiding this comment

Licht-T Sep 24, 2017

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Licht-T commented Sep 24, 2017

jreback commented Sep 24, 2017

Licht-T commented Sep 26, 2017

jreback commented Sep 26, 2017

jreback commented Sep 26, 2017

Licht-T commented Sep 26, 2017

Licht-T commented Sep 26, 2017

Licht-T commented Sep 26, 2017

jreback commented Sep 26, 2017

jreback commented Sep 27, 2017

jreback commented Sep 28, 2017

Licht-T commented Sep 18, 2017 •

edited

Loading

pep8speaks commented Sep 22, 2017 •

edited

Loading

codecov bot commented Sep 22, 2017 •

edited

Loading