Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix make_sparse mask generation #17574

Merged
merged 12 commits into from
Sep 28, 2017

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Sep 18, 2017

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This is the part of #17386.
The mask generation sequence in make_sparse treated two data, which have same bits, as same if array is dtype=object.
This bug made SparseArray([False, 0, 100.0, 0.0], fill_value=False, dtype=np.object) is evaluated as [False, False, 100.0, False]

@Licht-T Licht-T changed the title BUG: Fix make_spase mask generation BUG: Fix make_sprase mask generation Sep 18, 2017
@Licht-T Licht-T changed the title BUG: Fix make_sprase mask generation BUG: Fix make_sparse mask generation Sep 18, 2017
@Licht-T Licht-T force-pushed the fix-make_spase-mask-generation branch from e4d23c0 to 501ba6f Compare September 18, 2017 15:58
@Licht-T Licht-T force-pushed the fix-make_spase-mask-generation branch from 501ba6f to b64c123 Compare September 18, 2017 15:59
@gfyoung gfyoung added Bug Sparse Sparse Data Type labels Sep 18, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a whatsnew entry for this ; use the PR number as issue

mask.append(e != fill_value)
else:
mask.append(True)
mask = np.array(mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very awkward

you can do this by starting with an array of True
the using .flat to set only the portion you need and do that with a list comprehension

@@ -61,6 +61,12 @@ def test_constructor_object_dtype(self):
assert arr.dtype == np.object
assert arr.fill_value == 'A'

data = [False, 0, 100.0, 0.0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issues number as a comment

@pep8speaks
Copy link

pep8speaks commented Sep 22, 2017

Hello @Licht-T! Thanks for updating the PR.

Line 31:80: E501 line too long (119 > 79 characters)
Line 32:80: E501 line too long (117 > 79 characters)
Line 34:80: E501 line too long (128 > 79 characters)
Line 35:80: E501 line too long (126 > 79 characters)
Line 37:80: E501 line too long (123 > 79 characters)
Line 38:80: E501 line too long (121 > 79 characters)
Line 40:80: E501 line too long (122 > 79 characters)
Line 41:80: E501 line too long (120 > 79 characters)
Line 80:80: E501 line too long (80 > 79 characters)

Comment last updated on September 27, 2017 at 11:39 Hours UTC

@codecov
Copy link

codecov bot commented Sep 22, 2017

Codecov Report

Merging #17574 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17574      +/-   ##
==========================================
- Coverage   91.22%    91.2%   -0.02%     
==========================================
  Files         163      163              
  Lines       49625    49632       +7     
==========================================
- Hits        45270    45268       -2     
- Misses       4355     4364       +9
Flag Coverage Δ
#multiple 88.99% <100%> (ø) ⬆️
#single 40.19% <25%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/array.py 91.44% <100%> (+0.14%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37e23d0...05bbd54. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 22, 2017

Codecov Report

Merging #17574 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17574      +/-   ##
==========================================
+ Coverage   91.24%   91.25%   +<.01%     
==========================================
  Files         163      163              
  Lines       49762    49764       +2     
==========================================
+ Hits        45406    45412       +6     
+ Misses       4356     4352       -4
Flag Coverage Δ
#multiple 89.05% <100%> (+0.02%) ⬆️
#single 40.34% <66.66%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/array.py 91.58% <100%> (+0.28%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44747c8...2c6bca4. Read the comment docs.

@Licht-T Licht-T force-pushed the fix-make_spase-mask-generation branch from 05bbd54 to 7727c5a Compare September 22, 2017 02:48
@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 22, 2017

@jreback Thanks for your review.

  • Fixed the mask generation method
  • Fixed the insufficient array comparison method in test case

mask = np.ones(len(arr), dtype=np.bool)
fv_type = type(fill_value)

itr = (type(x) is fv_type for x in arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still quite cumbersome can you simplify

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 22, 2017

@jreback The method is now simplified.

@@ -789,7 +790,15 @@ def make_sparse(arr, kind='block', fill_value=None):
if is_string_dtype(arr):
arr = arr.astype(object)

mask = arr != fill_value
if is_object_dtype(arr.dtype):
mask = np.ones(len(arr), dtype=np.bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok pls add a comment on what/why this is happening.

no general numpy routine to do this? (I don't think we have a pandas one), but have a look around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Added comments. I haven't found yet. I think there is no numpy general method.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some asvs for sparse? (if you prefer another issue that's ok too). Need to benchmarks things like this.

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 24, 2017

@jreback Added asvs for SparseArray constructor, but it does not run due to the bug in asv_bench/benchmarks/sparse.py.
I created another PR: #17659.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2017

what do benchmarks show?

@jreback jreback added this to the 0.21.0 milestone Sep 25, 2017
@Licht-T Licht-T force-pushed the fix-make_spase-mask-generation branch from 08f5a22 to a48f957 Compare September 26, 2017 12:54
@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 26, 2017

@jreback Here is the results of tests.
This is pretty much slow. Should I reimplement by Cython?

[asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor                        21:39:21  ☁  fix-make_spase-mask-generation ☂ 𝝙 ⚡ ✭ 𝝙
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks)
[  0.00%] · For pandas commit hash a48f9572:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  8.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                            39.3±2ms
[ 16.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                             7.67±1ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                            39.6±0.4ms
[ 33.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                             6.69±0.3ms
[ 41.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent                                                             556±20ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent                                                               486±6ms
[ 50.00%] · For pandas commit hash 5279a172:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...................................................................
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 58.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                            37.3±3ms
[ 66.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                           6.70±0.3ms
[ 75.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                              38.0±1ms
[ 83.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                             6.16±0.1ms
[ 91.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent                                                             53.8±2ms
[100.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent                                                            32.4±0.5ms
       before           after         ratio
     [5279a172]       [a48f9572]
+      32.4±0.5ms          486±6ms    15.01  sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent
+        53.8±2ms         556±20ms    10.34  sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

@Licht-T yes that is slow. Sure if you can do that in cython would be great. its only a single case so shouldn't be much code. not sure how much speedup to get though as you can't really type anything.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

also, would actually be ok with NOT allowing a not-nan fill_value for object dtypes as well. Slightly more inflexible but generally using sparse with object dtypes is not very performant anyhow.

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 26, 2017

@jreback Here is the Cython result.

[asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor                        23:04:10  ☁  fix-make_spase-mask-generation-new ☂ ⚡ ✭
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks)
[  0.00%] · For pandas commit hash ae079f69:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..................................................................
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  8.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                          36.9±0.6ms
[ 16.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                           6.91±0.4ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                              38.5±2ms
[ 33.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                            5.88±0.05ms
[ 41.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent                                                             85.9±2ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent                                                            52.7±0.6ms

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 26, 2017

This is rough reimplement and only checking whether this passes tests. It might be a little more faster.

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 26, 2017

@jreback Here is the final result. NaN and non-NaN are almost same.

[asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor                         0:35:38  ☁  fix-make_spase-mask-generation ☂ 𝝙 ⚡ ✭ 𝝙
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 16 total benchmarks (2 commits * 1 environments * 8 benchmarks)
[  0.00%] · For pandas commit hash 35890726:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  6.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                            40.7±1ms
[ 12.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                           7.34±0.1ms
[ 18.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                            37.3±0.6ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                            6.34±0.09ms
[ 31.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_10percent                                              78.2±1ms
[ 37.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_1percent                                             44.0±0.6ms
[ 43.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent                                          87.0±2ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent                                         51.6±0.5ms
[ 50.00%] · For pandas commit hash 5279a172:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 56.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent                                                            37.2±2ms
[ 62.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent                                                           7.07±0.3ms
[ 68.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent                                                              46.8±3ms
[ 75.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent                                                            6.66±0.05ms
[ 81.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_10percent                                              82.8±1ms
[ 87.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_1percent                                               46.9±1ms
[ 93.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent                                          55.0±2ms
[100.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent                                         34.4±0.9ms
       before           after         ratio
     [5279a172]       [35890726]
+        55.0±2ms         87.0±2ms     1.58  sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent
+      34.4±0.9ms       51.6±0.5ms     1.50  sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent
-        46.8±3ms       37.3±0.6ms     0.80  sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

ok this looks good
going to give a final look in a bit

@jreback
Copy link
Contributor

jreback commented Sep 27, 2017

lgtm ping on green.

@jreback jreback merged commit 45bd471 into pandas-dev:master Sep 28, 2017
@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

thanks @Licht-T keep em coming!

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants