BUG: Don't overflow in DataFrame init #18624

gfyoung · 2017-12-04T10:16:07Z

For integers larger than what uint64 can handle (or smaller than what int64 can handle), we gracefully default to the object dtype instead of overflowing.

Closes #18584.

TomAugspurger · 2017-12-04T13:16:12Z

I assume it'll be fine, but could you do a quick performance check on ints smaller than int64 max to make sure things look OK?

jreback · 2017-12-05T00:20:07Z

doc/source/whatsnew/v0.22.0.txt

@@ -262,4 +262,4 @@ Other
 - Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
 - Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
 - Adding a ``Period`` object to a ``datetime`` or ``Timestamp`` object will now correctly raise a ``TypeError`` (:issue:`17983`)
-


move to conversion (prob should move some of the other ones appropriately as well) but other PR for that

jreback · 2017-12-05T00:20:26Z

pandas/_libs/src/inference.pyx

@@ -1263,7 +1263,7 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
            if not seen.null_:
                seen.saw_int(int(val))

-                if seen.uint_ and seen.sint_:
+                if (seen.uint_ and seen.sint_) or val > oUINT64_MAX:


can you update the doc-string as well

gfyoung · 2017-12-05T04:50:17Z

I assume it'll be fine, but could you do a quick performance check on ints smaller than int64 max to make sure things look OK?

@TomAugspurger : Didn't seen any perf degradation, which makes sense since the changes are just new int comparisons.

codecov · 2017-12-05T07:10:44Z

Codecov Report

Merging #18624 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #18624      +/-   ##
==========================================
- Coverage    91.6%   91.58%   -0.02%     
==========================================
  Files         153      153              
  Lines       51253    51253              
==========================================
- Hits        46950    46941       -9     
- Misses       4303     4312       +9

Flag	Coverage Δ
#multiple	`89.45% <ø> (ø)`	⬆️
#single	`40.67% <ø> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.81% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52fefd5...5f5d332. Read the comment docs.

codecov · 2017-12-05T07:10:52Z

Codecov Report

Merging #18624 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #18624      +/-   ##
==========================================
+ Coverage   91.58%   91.58%   +<.01%     
==========================================
  Files         153      153              
  Lines       51250    51250              
==========================================
+ Hits        46935    46938       +3     
+ Misses       4315     4312       -3

Flag	Coverage Δ
#multiple	`89.44% <ø> (+0.02%)`	⬆️
#single	`40.67% <ø> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.81% <0%> (-0.1%)`	⬇️
pandas/plotting/_converter.py	`65.25% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3c04e2...9d5abd3. Read the comment docs.

gfyoung · 2017-12-05T07:22:19Z

All green. @TomAugspurger @jreback PTAL

jorisvandenbossche

Can you also check the case of large negative integers ?

jorisvandenbossche · 2017-12-05T08:29:59Z

pandas/tests/frame/test_constructors.py

+    def test_constructor_overflow_uint64(self):
+        # see gh-18584
+        values = np.array([2**64], dtype=object)
+        result = DataFrame(values)


This test already passes on master, it was when you didn't already have an object array that it failed:

In [6]: pd.DataFrame(np.array([2**64], dtype=object)) Out[6]: 0 0 18446744073709551616 In [7]: pd.DataFrame([2**64]) ... OverflowError: Python int too large to convert to C unsigned long

(but keep also this one)

Can you also check the case of large negative integers ?

We do now. 😄

Added new tests as well.

jorisvandenbossche · 2017-12-05T10:48:19Z

pandas/tests/frame/test_constructors.py

@@ -196,8 +196,10 @@ def test_constructor_overflow_int64(self):
        assert df_crawls['uid'].dtype == np.uint64

    @pytest.mark.parametrize("values", [np.array([2**64], dtype=object),
-                                        np.array([2**64]), [2**64]])
-    def test_constructor_overflow_uint64(self, values):
+                                        np.array([2**65]), [2**64 + 1],


np.array([2**65]) should be the same as the np.array([2**64], dtype=object) case ?

Should be but just testing out different values and initialization of large data.

I mean

In [70]: np.array([2**64], dtype=object) Out[70]: array([18446744073709551616], dtype=object) In [71]: np.array([2**64]) Out[71]: array([18446744073709551616], dtype=object)

is twice the same you test?

No? They're two different values. Just trying out two different large numbers.

ah sorry, didn't see the difference in the power term

jreback · 2017-12-05T11:30:25Z

structure lgtm. I think @jorisvandenbossche has a comment.

For integers larger than what uint64 can handle, we gracefully default to the object dtype instead of overflowing. Closes pandas-devgh-18584.

For integers smaller than what int64 can handle, we gracefully default to the object dtype instead of overflowing.

gfyoung · 2017-12-05T19:59:12Z

@jreback @jorisvandenbossche @TomAugspurger :

All comments have been addressed, and all is green. PTAL.

jorisvandenbossche · 2017-12-05T22:54:50Z

Thanks!

gfyoung added Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions labels Dec 4, 2017

gfyoung added this to the 0.22.0 milestone Dec 4, 2017

gfyoung force-pushed the dataframe-construct-uint64 branch from 3a53c7d to 9d39610 Compare December 4, 2017 17:03

jreback requested changes Dec 5, 2017

View reviewed changes

gfyoung force-pushed the dataframe-construct-uint64 branch from 9d39610 to f9c8991 Compare December 5, 2017 04:44

gfyoung force-pushed the dataframe-construct-uint64 branch from f9c8991 to 5f5d332 Compare December 5, 2017 04:51

jorisvandenbossche requested changes Dec 5, 2017

View reviewed changes

gfyoung force-pushed the dataframe-construct-uint64 branch from 5f5d332 to 30cee48 Compare December 5, 2017 09:55

gfyoung changed the title ~~BUG: Don't overflow in DataFrame init with uint~~ BUG: Don't overflow in DataFrame init Dec 5, 2017

gfyoung force-pushed the dataframe-construct-uint64 branch from 30cee48 to 825bdcf Compare December 5, 2017 09:58

jorisvandenbossche reviewed Dec 5, 2017

View reviewed changes

jreback approved these changes Dec 5, 2017

View reviewed changes

gfyoung added 2 commits December 5, 2017 08:55

BUG: Don't overflow in DataFrame init with uint

1c11b72

For integers larger than what uint64 can handle, we gracefully default to the object dtype instead of overflowing. Closes pandas-devgh-18584.

Don't overflow in DataFrame init with int

9d5abd3

For integers smaller than what int64 can handle, we gracefully default to the object dtype instead of overflowing.

gfyoung force-pushed the dataframe-construct-uint64 branch from 825bdcf to 9d5abd3 Compare December 5, 2017 16:56

jorisvandenbossche approved these changes Dec 5, 2017

View reviewed changes

jorisvandenbossche merged commit 6b6cfb8 into pandas-dev:master Dec 5, 2017

gfyoung deleted the dataframe-construct-uint64 branch December 6, 2017 02:35

qwhelan mentioned this pull request Jan 5, 2019

PERF: 10x speedup in Series/DataFrame construction for lists of ints #24647

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Don't overflow in DataFrame init #18624

BUG: Don't overflow in DataFrame init #18624

gfyoung commented Dec 4, 2017 •

edited

TomAugspurger commented Dec 4, 2017

jreback Dec 5, 2017

gfyoung Dec 5, 2017

jreback Dec 5, 2017

gfyoung Dec 5, 2017

gfyoung commented Dec 5, 2017 •

edited

codecov bot commented Dec 5, 2017

codecov bot commented Dec 5, 2017 •

edited

gfyoung commented Dec 5, 2017

jorisvandenbossche left a comment

jorisvandenbossche Dec 5, 2017

gfyoung Dec 5, 2017

jorisvandenbossche Dec 5, 2017

gfyoung Dec 5, 2017 •

edited

jorisvandenbossche Dec 5, 2017

gfyoung Dec 5, 2017

jorisvandenbossche Dec 5, 2017

jreback commented Dec 5, 2017

gfyoung commented Dec 5, 2017 •

edited

jorisvandenbossche commented Dec 5, 2017

BUG: Don't overflow in DataFrame init #18624

BUG: Don't overflow in DataFrame init #18624

Conversation

gfyoung commented Dec 4, 2017 • edited

TomAugspurger commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Dec 5, 2017 • edited

codecov bot commented Dec 5, 2017

Codecov Report

codecov bot commented Dec 5, 2017 • edited

Codecov Report

gfyoung commented Dec 5, 2017

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 5, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 5, 2017

gfyoung commented Dec 5, 2017 • edited

jorisvandenbossche commented Dec 5, 2017

gfyoung commented Dec 4, 2017 •

edited

gfyoung commented Dec 5, 2017 •

edited

codecov bot commented Dec 5, 2017 •

edited

gfyoung Dec 5, 2017 •

edited

gfyoung commented Dec 5, 2017 •

edited