Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Don't overflow in DataFrame init #18624

Merged

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 4, 2017

For integers larger than what uint64 can handle (or smaller than what int64 can handle), we gracefully default to the object dtype instead of overflowing.

Closes #18584.

@gfyoung gfyoung added Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions labels Dec 4, 2017
@gfyoung gfyoung added this to the 0.22.0 milestone Dec 4, 2017
@TomAugspurger
Copy link
Contributor

I assume it'll be fine, but could you do a quick performance check on ints smaller than int64 max to make sure things look OK?

@@ -262,4 +262,4 @@ Other
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
- Adding a ``Period`` object to a ``datetime`` or ``Timestamp`` object will now correctly raise a ``TypeError`` (:issue:`17983`)
-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to conversion (prob should move some of the other ones appropriately as well) but other PR for that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@@ -1263,7 +1263,7 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
if not seen.null_:
seen.saw_int(int(val))

if seen.uint_ and seen.sint_:
if (seen.uint_ and seen.sint_) or val > oUINT64_MAX:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the doc-string as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 5, 2017

I assume it'll be fine, but could you do a quick performance check on ints smaller than int64 max to make sure things look OK?

@TomAugspurger : Didn't seen any perf degradation, which makes sense since the changes are just new int comparisons.

@codecov
Copy link

codecov bot commented Dec 5, 2017

Codecov Report

Merging #18624 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18624      +/-   ##
==========================================
- Coverage    91.6%   91.58%   -0.02%     
==========================================
  Files         153      153              
  Lines       51253    51253              
==========================================
- Hits        46950    46941       -9     
- Misses       4303     4312       +9
Flag Coverage Δ
#multiple 89.45% <ø> (ø) ⬆️
#single 40.67% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52fefd5...5f5d332. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 5, 2017

Codecov Report

Merging #18624 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18624      +/-   ##
==========================================
+ Coverage   91.58%   91.58%   +<.01%     
==========================================
  Files         153      153              
  Lines       51250    51250              
==========================================
+ Hits        46935    46938       +3     
+ Misses       4315     4312       -3
Flag Coverage Δ
#multiple 89.44% <ø> (+0.02%) ⬆️
#single 40.67% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.25% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3c04e2...9d5abd3. Read the comment docs.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 5, 2017

All green. @TomAugspurger @jreback PTAL

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check the case of large negative integers ?

def test_constructor_overflow_uint64(self):
# see gh-18584
values = np.array([2**64], dtype=object)
result = DataFrame(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test already passes on master, it was when you didn't already have an object array that it failed:

In [6]: pd.DataFrame(np.array([2**64], dtype=object))
Out[6]: 
                      0
0  18446744073709551616

In [7]: pd.DataFrame([2**64])
...
OverflowError: Python int too large to convert to C unsigned long

(but keep also this one)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check the case of large negative integers ?

We do now. 😄

Added new tests as well.

@gfyoung gfyoung changed the title BUG: Don't overflow in DataFrame init with uint BUG: Don't overflow in DataFrame init Dec 5, 2017
@@ -196,8 +196,10 @@ def test_constructor_overflow_int64(self):
assert df_crawls['uid'].dtype == np.uint64

@pytest.mark.parametrize("values", [np.array([2**64], dtype=object),
np.array([2**64]), [2**64]])
def test_constructor_overflow_uint64(self, values):
np.array([2**65]), [2**64 + 1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.array([2**65]) should be the same as the np.array([2**64], dtype=object) case ?

Copy link
Member Author

@gfyoung gfyoung Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be but just testing out different values and initialization of large data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean

In [70]: np.array([2**64], dtype=object)
Out[70]: array([18446744073709551616], dtype=object)

In [71]: np.array([2**64])
Out[71]: array([18446744073709551616], dtype=object)

is twice the same you test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No? They're two different values. Just trying out two different large numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry, didn't see the difference in the power term

@jreback
Copy link
Contributor

jreback commented Dec 5, 2017

structure lgtm. I think @jorisvandenbossche has a comment.

For integers larger than what uint64 can handle,
we gracefully default to the object dtype instead
of overflowing.

Closes pandas-devgh-18584.
For integers smaller than what int64 can
handle, we gracefully default to the object
dtype instead of overflowing.
@gfyoung
Copy link
Member Author

gfyoung commented Dec 5, 2017

@jreback @jorisvandenbossche @TomAugspurger :

All comments have been addressed, and all is green. PTAL.

@jorisvandenbossche jorisvandenbossche merged commit 6b6cfb8 into pandas-dev:master Dec 5, 2017
@jorisvandenbossche
Copy link
Member

Thanks!

@gfyoung gfyoung deleted the dataframe-construct-uint64 branch December 6, 2017 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants