BUG: 2D ndarray of dtype 'object' is always copied upon construction #39272

irgolic · 2021-01-19T15:40:46Z

This change is aesthetically unpleasing, it indents too far. #26825 already suggested to separate this part into a function. Please let me know what exactly I should abstract into a function, and where to put it. Alternatively, amend this PR yourself.

Thanks for pandas. :)

closes BUG: 2D ndarray of dtype 'object' is always copied upon construction #39263
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

jreback

you will have to measure performance if it changes in any way. run the asv suite. and see which benchmarks are affected.

I am not sure how this change actually helps anything. e.g. how does sharing memory actually help here? (because the second you do almost anything these blocks will likely get split apart and copied again).

so having a specific benchmark is really important here. sharing memory is not that interesting as its fleeting.

jreback · 2021-01-19T23:49:12Z

doc/source/whatsnew/v1.3.0.rst

@@ -231,6 +231,7 @@ Datetimelike
 - Bug in :meth:`DatetimeIndex.intersection`, :meth:`DatetimeIndex.symmetric_difference`, :meth:`PeriodIndex.intersection`, :meth:`PeriodIndex.symmetric_difference` always returning object-dtype when operating with :class:`CategoricalIndex` (:issue:`38741`)
 - Bug in :meth:`Series.where` incorrectly casting ``datetime64`` values to ``int64`` (:issue:`37682`)
 - Bug in :class:`Categorical` incorrectly typecasting ``datetime`` object to ``Timestamp`` (:issue:`38878`)
+- Bug in :func:`pandas.core.internals.construction.init_ndarray` unnecessarily copying all object arrays after datetime inference


we don't add a whatsnew for internal functions, rather you want to note what a user would care about. so you can say about the same thing, but its about the DataFrame constructor.

add the issue number

jreback · 2021-01-19T23:53:58Z

pandas/core/internals/construction.py

-                make_block(dvals_list[n], placement=[n], ndim=2)
-                for n in range(len(dvals_list))
-            ]
+            i = 0


yeah this is not great. we already do this type of dtype grouping inside the block manager constructors. much better to do there. However this might be non-trivial and/or might have a bit impact performance.

so i would try to re-use one of the existing grouping rountines here instead.

Could you point me to the module?

i dont think we have anything quite like this, in part because i wrote something similar in #33597

what if you dont do the make_block calls here, but instead keep construct the arrays, and then instead of calling create_block_manager_from_blocks at the end could call create_block_manager_from_arrays?

That solves it but has a ripple effect of making other stuff fail. That constructor might need to be adjusted for use in instantiating a DataFrame.

can you post one of the actual tracebacks?

Some of these were a missing ValueError in _form_blocks, but it still ended up copying everything – that's just how create_block_manager_from_arrays is implemented. I've implemented a new _from_array constructor, which will fall back to _form_blocks in the case of a datetime conversion, but otherwise doesn't copy.

I tried to avoid adding a kwarg to the constructor, but it seemed like that would get messier.

jreback · 2021-01-19T23:56:26Z

cc @jbrockmendel

jbrockmendel · 2021-01-20T00:06:37Z

side note, after #38939 we can also avoid consolidating in this case, which would avoid making copies

jbrockmendel · 2021-01-20T00:14:46Z

pandas/core/internals/construction.py

+                    )
+                else:
+                    block_values.extend(
+                        make_block(


what happens here if we have two consecutive datetime64tz columns?

In[5]: block_values Out[5]: [DatetimeTZBlock: slice(0, 1, 1), 1 x 1, dtype: datetime64[ns, UTC], DatetimeTZBlock: slice(1, 2, 1), 1 x 1, dtype: datetime64[ns, UTC], DatetimeTZBlock: slice(2, 3, 1), 1 x 1, dtype: datetime64[ns, UTC]]

irgolic · 2021-01-20T00:14:51Z

From what I can gather, the problem is that not all (but some) datetimelike types are dtypes. Checking equality between a dtype and a non-dtype throws an exception, which is likely why this wasn't implemented in the block manager constructor (where I assume everything works in dtypes?)

Edit: ignore this comment

jbrockmendel · 2021-01-20T00:15:58Z

Checking equality between a dtype and a non-dtype throws an exception

Im guessing the issue is trying to do some_numpy_dtype == some_extension_dtype which does raise. if you use is_dtype_equal(some_numpy_dtype, some_extension_dtype)` you'll get a bool back

pep8speaks · 2021-01-20T17:20:22Z

Hello @irgolic! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-14 18:45:43 UTC

irgolic · 2021-01-20T18:34:09Z

I am not sure how this change actually helps anything. e.g. how does sharing memory actually help here? (because the second you do almost anything these blocks will likely get split apart and copied again).

I'm a core developer of https://github.com/biolab/orange3, focusing on UX. Orange's data objects are also based on numpy arrays, Orange's Table exposes them as table.X, table.Y, table.metas. I'm trying to wrap our numpy arrays in DataFrame objects, and allow people to use pandas as an accessor in Table (table.X_df, table.Y_df, table.metas_df) (biolab/orange3#5189).

While you're right in that most operations will split it apart anyway, in my use case, it'll avoid an unnecessary copy operation.

you will have to measure performance if it changes in any way. run the asv suite. and see which benchmarks are affected.

On it. If none of the benchmarks change, I'll add one.

pandas/core/internals/managers.py

irgolic · 2021-01-21T15:55:10Z

Let me know if you prefer to change create_block_manager_from_arrays, as is done in the latest commit, and I'll fix it up.

As the function is used elsewhere too, it makes a test fail:

def test_constructor_series_copy(self, float_frame):
    series = float_frame._series
  
    df = DataFrame({"A": series["A"]})
  
    df["A"][:] = 5
  
    assert not (series["A"] == 5).all()  # False

So, create_block_manager_from_arrays won't create a copy of the series, because it's unnecessary to do so. There's only one test in pandas relying on this behavior, I'm not sure how much code around the world relies on it.
If the DataFrame contains anything more than a single series/array, it'll copy. This is how I adjusted the test too:

def test_constructor_series_copy(self, float_frame):
    series = float_frame._series
  
    df = DataFrame({"A": series["A"], "B": series["A"]})
  
    df["A"][:] = 5
  
    assert not (series["A"] == 5).all()  # True

jreback · 2021-01-21T16:23:17Z

pandas/core/internals/managers.py

-    names_idx = names
-    if names_idx.equals(axes[0]):
-        names_indexer = np.arange(len(names_idx))
+    if len(arrays) == 1:


for sure needs comments, and even better pls separate this out to a new function.

no, this looks almost like a groupby to me. i am puzzled why the axes are involved here at all.

That's how it was already written. It finds all variables of the same block type, and makes blocks out of them. If you pass in more arrays than names, only those arrays whose names were specified are blockified.

I could rewrite this into a groupby, I guess. It'll create more contiguous blocks than it does currently.

If you look at create_block_manager_from_blocks, you'll see that it allows passing a singular array (not block), and it'll create a contiguous block for it. I could do this in create_block_manager_from_arrays, but IIRC that made a lot of things fail, like init_dict, which uses the same constructor. Maybe #38939 will fix that.

I still think making a new constructor would be more elegant. It would allow removal of this weird bit in create_block_manager_from_blocks too:

def create_block_manager_from_blocks(blocks, axes: List[Index]) -> BlockManager: try: if len(blocks) == 1 and not isinstance(blocks[0], Block): # if blocks[0] is of length 0, return empty blocks if not len(blocks[0]): blocks = [] else: # It's OK if a single block is passed as values, its placement # is basically "all items", but if there're many, don't bother # converting, it's an error anyway. blocks = [ make_block( values=blocks[0], placement=slice(0, len(axes[0])), ndim=2 ) ] ...

It's unintuitive to pass something that's not a block to a method suffixed with _from_blocks.

jreback · 2021-01-21T16:24:07Z

pandas/core/internals/managers.py


+    if len(arrays) == 1:


add omments

doc/source/whatsnew/v1.3.1.rst

jreback · 2021-07-08T13:02:44Z

pandas/tests/frame/test_constructors.py

@@ -1845,6 +1845,14 @@ def test_constructor_series_copy(self, float_frame):

        assert not (series["A"] == 5).all()

+    def test_object_array_does_not_copy(self):
+        a = np.array(["a", "b"], dtype="object")


add the issue number as a comment

jreback · 2021-07-08T13:03:04Z

pandas/tests/frame/test_constructors.py

+        assert np.shares_memory(df.values, a)
+        df2 = DataFrame(b)
+        assert np.shares_memory(df2.values, b)
+


assert the frames are equal as well

did you mean assert the numpy arrays equal? They're not equal, but they do share memory:

> np.testing.assert_array_equal(df.values, a) E AssertionError: E Arrays are not equal E E (shapes (2, 1), (2,) mismatch) E x: array([['a'], E ['b']], dtype=object) E y: array(['a', 'b'], dtype=object)

(note also I've split the test into two)

pandas/core/internals/construction.py

This reverts commit 672f9c0.

irgolic · 2021-07-11T20:18:06Z

I finally figured out what was failing. It was the new ArrayManager. I just needed to add @td.skip_array_manager_invalid_test on the test.

The way the new array data management is set up currently is such that it copies all incoming numpy arrays. Our use case for pandas relies on 'wrapping' numpy arrays without copying them. Please consider this when enforcing the new array data manager. Although I'm assuming that this won't be standard for a while.

Until then, please accept this contribution which fixes the 2D object array copy bug in block manager. Note, I've also refactored that bit and made a create_block_manager_from_array methods instead of _from_blocks. create_block_manager_from_blocks is now not used anywhere in the pandas codebase anymore, but dask calls this pandas internal, so it should probably be kept there for now.
@jreback Sorry for bouncing back and forth between the two implementations. I think this one's cleaner, let me know if I should change it back or amend it.

(also not sure why it says pandas-dev.pandas failed, all the tests passed)

jbrockmendel · 2021-07-13T16:24:00Z

The way the new array data management is set up currently is such that it copies all incoming numpy arrays. Our use case for pandas relies on 'wrapping' numpy arrays without copying them. Please consider this when enforcing the new array data manager. Although I'm assuming that this won't be standard for a while.

IIRC there was some discussion about this copy being avoided, not sure what happened to that. Opening an issue about it might be worthwhile.

pandas/core/internals/construction.py

jbrockmendel · 2021-07-13T16:27:55Z

pandas/core/internals/construction.py

+            # don't convert (and copy) the objects if no type inference occurs
+            if any(
+                not is_dtype_equal(instance.dtype, array.dtype)
+                for instance in maybe_datetime


double-check me on this, but i think a more performant alternative would be

obj_columns = [x for x in array] maybe_datetime = [maybe_infer_to_datetimelike(x) for x in obj_columns] if any(x is not y for x, y in zip(obj_columns, maybe_datetime): [...]

I'm running asv to see if there's a performance difference

Here's the full asv run comparing the last commit (your suggestion) and the second to last commit. Any of these in particular you'd like me to rerun?

before after ratio [dc2ae20d] [96dd1b9e] <2d-object-dont-copy> <2d-object-dont-copy~1> + 4.43±0.01ms 94.2±0.3ms 21.27 indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'uniqu e_monotonic_inc') + 59.0±3ms 75.8±0.4ms 1.28 inference.ToTimedeltaErrors.time_convert('coerce') + 45.8±5ms 58.5±5ms 1.28 gil.ParallelGroupbyMethods.time_loop(2, 'last') + 4.37±0.2ms 5.34±0.3ms 1.22 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'max') + 5.31±0.1ms 6.45±0.5ms 1.21 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'sum') + 4.46±0.1ms 5.40±0.3ms 1.21 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'mean') + 5.38±0.2ms 6.51±0.6ms 1.21 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'max') + 147±9ms 178±1ms 1.21 inference.ToDatetimeISO8601.time_iso8601_tz_spaceformat + 72.1±5ms 86.7±1ms 1.20 inference.ToDatetimeCache.time_dup_string_tzoffset_dates(False) + 4.57±0.09ms 5.47±0.3ms 1.20 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'max') + 5.23±0.2ms 6.23±0.4ms 1.19 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'mean') + 5.40±0.1ms 6.42±0.5ms 1.19 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'median') + 4.29±0.1ms 5.10±0.3ms 1.19 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'mean') + 4.52±0.2ms 5.37±0.2ms 1.19 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'median') + 4.45±0.2ms 5.28±0.4ms 1.19 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'mean') + 15.4±0.6μs 18.2±0.3μs 1.19 series_methods.NanOps.time_func('sum', 1000, 'boolean') + 4.39±0.1ms 5.20±0.2ms 1.19 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'max') + 5.25±0.2ms 6.23±0.4ms 1.18 rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'median') + 5.20±0.2ms 6.16±0.3ms 1.18 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'sum') + 4.37±0.2ms 5.17±0.3ms 1.18 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'median') + 4.53±0.04ms 5.35±0.3ms 1.18 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'min') + 4.35±0.1ms 5.13±0.3ms 1.18 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'sum') + 4.40±0.1ms 5.18±0.3ms 1.18 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'median') + 5.25±0.2ms 6.17±0.2ms 1.17 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'min') + 5.39±0.2ms 6.32±0.4ms 1.17 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'min') + 58.8±2ms 68.9±7ms 1.17 inference.ToTimedeltaErrors.time_convert('ignore') + 4.42±0.2ms 5.17±0.2ms 1.17 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'min') + 5.33±0.1ms 6.23±0.2ms 1.17 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'min') + 5.38±0.3ms 6.27±0.4ms 1.17 rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'max') + 4.73±0.3ms 5.51±0.4ms 1.17 rolling.Engine.time_expanding_apply('Series', 'int', <function sum>, 'cython', 'max') + 5.20±0.1ms 6.06±0.3ms 1.16 rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'min') + 4.60±0.2ms 5.36±0.3ms 1.16 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'max') + 13.2±0.4μs 15.3±0.3μs 1.16 series_methods.NanOps.time_func('max', 1000, 'boolean') + 4.40±0.06ms 5.11±0.4ms 1.16 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'sum') + 4.57±0.1ms 5.29±0.3ms 1.16 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'sum') + 4.67±0.09ms 5.41±0.2ms 1.16 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'mean') + 5.39±0.1ms 6.23±0.5ms 1.16 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'mean') + 5.21±0.1ms 6.02±0.4ms 1.16 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'max') + 4.59±0.03ms 5.29±0.2ms 1.15 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'median') + 5.31±0.1ms 6.12±0.3ms 1.15 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'mean') + 4.55±0.03ms 5.24±0.3ms 1.15 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'sum') + 5.37±0.1ms 6.19±0.4ms 1.15 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'sum') + 16.4±0.7μs 18.7±1μs 1.14 series_methods.SearchSorted.time_searchsorted('uint8') + 5.26±0.1ms 5.99±0.3ms 1.14 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'median') + 4.64±0.1ms 5.29±0.3ms 1.14 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'min') + 1.17±0.01ms 1.33±0.07ms 1.14 hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 5 00000) + 2.58±0.03μs 2.93±0.08μs 1.13 indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Float64Engine'>, <class 'numpy.f loat64'>), 'monotonic_incr') + 5.21±0.2ms 5.91±0.3ms 1.13 rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'mean') + 4.28±0ms 4.84±0.5ms 1.13 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 10000, datetime.timezone(datetime.timedelta(se conds=3600))) + 4.51±0.2ms 5.10±0.2ms 1.13 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'min') + 5.50±0.2ms 6.21±0.2ms 1.13 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'max') + 16.5±0.4μs 18.6±1μs 1.13 series_methods.SearchSorted.time_searchsorted('int16') + 2.57±0.01μs 2.89±0.2μs 1.13 indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int64Engine'>, <class 'numpy.int 64'>), 'monotonic_incr') + 452±0.8ms 508±50ms 1.12 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 1000000, datetime.timezone(datetime.timedelta( seconds=3600))) + 806±20μs 904±40μs 1.12 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ne>) + 3.20±0.04μs 3.59±0.2μs 1.12 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'month', 'QS', 3) + 13.2±0.5μs 14.8±0.6μs 1.12 series_methods.NanOps.time_func('min', 1000, 'boolean') + 121±5μs 136±4μs 1.12 multiindex_object.GetLoc.time_large_get_loc + 47.6±0.2ms 53.0±1ms 1.11 rolling.TableMethod.time_apply('single') + 4.17±0.02μs 4.64±0.2μs 1.11 categoricals.CategoricalSlicing.time_getitem_scalar('non_monotonic') + 15.9±0.3ms 17.7±0.2ms 1.11 inference.ToTimedelta.time_convert_string_days + 741±20μs 824±20μs 1.11 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function eq>) + 732±20μs 813±20μs 1.11 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function le>) + 3.32±0.03μs 3.67±0.2μs 1.11 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'year', 'B', 5) + 3.44±0.02μs 3.81±0.3μs 1.11 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'year', 'B', 5) + 113±0.7μs 125±1μs 1.11 indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monoton ic_inc') + 4.16±0.02μs 4.60±0.4μs 1.10 categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_incr') + 13.0±0.2μs 14.3±0.2μs 1.10 series_methods.NanOps.time_func('sum', 1000, 'Int64') + 60.3±1ms 66.6±1ms 1.10 rolling.Groupby.time_rolling_int('mean') + 14.4±0.1μs 15.9±0.5μs 1.10 series_methods.NanOps.time_func('min', 1000, 'Int64') + 60.7±1ms 67.0±1ms 1.10 rolling.Groupby.time_rolling_int('max') + 61.5±1ms 67.9±1ms 1.10 rolling.Groupby.time_rolling_int('median') + 544±20ns 600±30ns 1.10 index_cached_properties.IndexCache.time_inferred_type('Int64Index') + 15.8±0.04μs 17.4±0.8μs 1.10 inference.MaybeConvertObjects.time_maybe_convert_objects + 437±0.7ms 482±40ms 1.10 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 1000000, None) + 60.5±2ms 66.6±1ms 1.10 rolling.Groupby.time_rolling_int('sum' (0)) + 3.36±0.01μs 3.70±0.2μs 1.10 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'end', 'year', 'B', 12) + 3.24±0.04μs 3.57±0.2μs 1.10 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'quarter', 'QS', 5) + 142±2μs 157±1μs 1.10 indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_mo notonic_inc') + 41.1±0.3ms 45.2±1ms 1.10 frame_methods.Isnull.time_isnull_obj + 3.47±0.02μs 3.82±0.2μs 1.10 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'end', 'quarter', 'QS', 12) - 8.68±0.5μs 7.89±0.02μs 0.91 tslibs.resolution.TimeResolution.time_get_resolution('h', 0, datetime.timezone(datetime.timedelta(seconds=3600)) ) - 8.79±0.6μs 7.98±0.04μs 0.91 tslibs.resolution.TimeResolution.time_get_resolution('D', 1, datetime.timezone(datetime.timedelta(seconds=3600)) ) - 18.0±1μs 16.3±0.05μs 0.91 tslibs.resolution.TimeResolution.time_get_resolution('ns', 1, tzlocal()) - 9.93±0.6μs 9.01±0.02μs 0.91 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 4000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 1.10±0.09s 998±5ms 0.91 arithmetic.OffsetArrayArithmetic.time_add_series_offset(<CustomBusinessMonthEnd> (0)) - 2.71±0.2μs 2.46±0μs 0.91 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('date', 0, None) - 9.98±0.7μs 9.05±0.03μs 0.91 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 2.75±0.2μs 2.50±0.02μs 0.91 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, tzlocal()) - 2.72±0.2μs 2.46±0μs 0.91 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, None) - 9.98±0.5μs 9.03±0.09μs 0.91 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 5000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 9.11±0.3μs 8.24±0.3μs 0.90 tslibs.timestamp.TimestampProperties.time_weekday_name(tzlocal(), None) - 9.91±0.6μs 8.96±0.1μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 12000, datetime.timezone(datetime.timedelta(se conds=3600))) - 10.0±0.6μs 9.06±0.04μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 3000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 10.1±0.8μs 9.09±0.03μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 8000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 9.94±0.7μs 8.99±0.1μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 10000, datetime.timezone(datetime.timedelta(se conds=3600))) - 19.8±1μs 17.9±0.05μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(100, 2000, <DstTzInfo 'US/Pacific' LMT-1 day, 16: 07:00 STD>) - 1.11±0.09s 998±3ms 0.90 arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<CustomBusinessMonthEnd> (0)) - 9.91±0.6μs 8.95±0.08μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 7000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 9.99±0.5μs 9.01±0.09μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 2.58±0.2μs 2.32±0.01μs 0.90 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, datetime.timezone.utc) - 9.97±0.7μs 8.96±0.06μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 11000, datetime.timezone(datetime.timedelta(se conds=3600))) - 29.3±0.3ms 26.2±0.5ms 0.90 rolling.Apply.time_rolling('Series', 3, 'int', <built-in function sum>, False) - 10.1±0.7μs 9.03±0.06μs 0.89 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 9000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 2.60±0.2μs 2.32±0.01μs 0.89 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, datetime.timezone.utc) - 10.0±0.7μs 8.98±0.04μs 0.89 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, datetime.timezone(datetime.timedelta(sec onds=3600))) - 8.65±0.2μs 7.73±0.1μs 0.89 tslibs.period.TimePeriodArrToDT64Arr.time_periodarray_to_dt64arr(1, 7000) - 9.99±0.7μs 8.93±0.05μs 0.89 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, datetime.timezone(datetime.timedelta(sec onds=3600))) - 26.7±2μs 23.8±0.2μs 0.89 tslibs.timestamp.TimestampOps.time_normalize(tzlocal()) - 2.77±0.2μs 2.47±0.02μs 0.89 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, None) - 6.29±0.6μs 5.60±0.08μs 0.89 tslibs.timedelta.TimedeltaConstructor.time_from_unit - 11.6±0.8μs 10.3±0.1μs 0.89 tslibs.timestamp.TimestampOps.time_replace_None(datetime.timezone(datetime.timedelta(seconds=3600))) - 10.1±0.6μs 8.95±0.06μs 0.89 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 10000, datetime.timezone(datetime.timedelta(se conds=3600))) - 5.64±0.3μs 5.00±0.04μs 0.89 tslibs.timestamp.TimestampOps.time_normalize(None) - 58.8±3μs 51.9±0.6μs 0.88 tslibs.timestamp.TimestampOps.time_normalize(tzfile('/usr/share/zoneinfo/Asia/Tokyo')) - 3.19±0.3μs 2.80±0.02μs 0.88 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('date', 1, None) - 9.72±1μs 8.54±0.05μs 0.88 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, datetime.timezone(datetime.timedelta(s econds=3600))) - 2.85±0.2μs 2.50±0.01μs 0.88 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, tzlocal()) - 1.27±0.03ms 1.11±0.06ms 0.87 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function mul>) - 15.5±2μs 13.6±0.2μs 0.87 tslibs.timedelta.TimedeltaConstructor.time_from_components - 8.74±1μs 7.57±0.05μs 0.87 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, <DstTzInfo 'US/Pacific' LMT-1 day, 16:07 :00 STD>) - 8.05±0.8μs 6.92±0.03μs 0.86 tslibs.timestamp.TimestampOps.time_replace_None(<DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>) - 1.59±0.01ms 1.36±0.01ms 0.85 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(100, 4000, tzlocal()) - 6.61±0.6μs 5.65±0.08μs 0.85 tslibs.timedelta.TimedeltaConstructor.time_from_string - 24.1±2μs 20.5±0.1μs 0.85 tslibs.timestamp.TimestampOps.time_normalize(datetime.timezone(datetime.timedelta(seconds=3600))) - 11.7±0.05ms 9.84±0.06ms 0.84 indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_mon otonic_inc') - 1.66±0.1ms 1.39±0.07ms 0.84 series_methods.NanOps.time_func('prod', 1000000, 'int8') - 10.8±0.2μs 9.05±0.04μs 0.84 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 4006, datetime.timezone(datetime.timedelta(sec onds=3600))) - 10.7±0.3μs 8.95±0.2μs 0.83 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, datetime.timezone(datetime.timedelta(sec onds=3600))) - 1.76±0.2ms 1.46±0.3ms 0.83 index_cached_properties.IndexCache.time_engine('MultiIndex') - 57.4±0.4ms 34.8±0.07ms 0.61 algos.isin.IsinWithArange.time_isin(<class 'numpy.object_'>, 8000, 0) - 93.1±1ms 400±3μs 0.00 hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)

Reran time_getitem_list_like, now it swung in the other direction, seems like it's just doing whatever it feels like.

before after ratio [dc2ae20d] [96dd1b9e] <2d-object-dont-copy> <2d-object-dont-copy~1> - 59.7±0.08ms 3.17±0.02ms 0.05 indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')

This reverts commit 9e00908.

irgolic · 2021-07-13T23:18:34Z

This is what it looks like now:

    if dtype is None and is_object_dtype(values.dtype):
        obj_columns = [x for x in values]
        maybe_datetime = [maybe_infer_to_datetimelike(x) for x in obj_columns]
        # don't convert (and copy) the objects if no type inference occurs
        if any(x is not y for x, y in zip(obj_columns, maybe_datetime)):
            dvals_list = [ensure_block_shape(dval, 2) for dval in maybe_datetime]
            block_values = [
                new_block(dvals_list[n], placement=n, ndim=2)
                for n in range(len(dvals_list))
            ]
        else:
            nb = new_block(values, placement=slice(len(columns)), ndim=2)
            block_values = [nb]
    else:
        nb = new_block(values, placement=slice(len(columns)), ndim=2)
        block_values = [nb]

Still not too pretty because of the double else. We could avoid it by abstracting the boolean condition into a separate private function. However without a walrus operator, we'd need to call maybe_infer_to_datetimelike twice (unless I'm missing something). Would that be fine, or should I leave it as is?

jbrockmendel · 2021-07-14T02:44:21Z

Still not too pretty because of the double else

im not bothered by this at all

jreback · 2021-07-15T04:22:10Z

thanks @irgolic thanks for sticking with it! very nice!

…andas-dev#39272)

irgolic force-pushed the 2d-object-dont-copy branch 6 times, most recently from 2ed6a20 to 40b0cea Compare January 19, 2021 22:22

jreback requested changes Jan 19, 2021

View reviewed changes

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Internals Related to non-user accessible pandas implementation labels Jan 19, 2021

jbrockmendel reviewed Jan 20, 2021

View reviewed changes

irgolic force-pushed the 2d-object-dont-copy branch 3 times, most recently from 560f894 to 322a677 Compare January 20, 2021 17:20

irgolic force-pushed the 2d-object-dont-copy branch 3 times, most recently from 57d8c6b to 1d95a25 Compare January 20, 2021 18:20

jreback reviewed Jan 20, 2021

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

irgolic force-pushed the 2d-object-dont-copy branch 4 times, most recently from cd2281e to 4ef2e7e Compare January 21, 2021 04:26

jreback requested changes Jan 21, 2021

View reviewed changes

irgolic force-pushed the 2d-object-dont-copy branch from 4ef2e7e to d2facee Compare January 21, 2021 16:38

jreback requested changes Jul 8, 2021

View reviewed changes

irgolic mentioned this pull request Jul 10, 2021

[NOMERGE] Test object array copying #42474

Closed

irgolic force-pushed the 2d-object-dont-copy branch 5 times, most recently from c0e144f to d271764 Compare July 11, 2021 17:47

irgolic added 5 commits July 11, 2021 19:19

split test

ac71b7d

Revert "try not replacing _from_blocks with _from_array"

9e00908

This reverts commit 672f9c0.

move whatsnew to v1.4.0

ba0d122

annotate tests with link to issue

f951246

Merge branch 'master' into 2d-object-dont-copy

99aca53

irgolic force-pushed the 2d-object-dont-copy branch from 3cc977c to 99aca53 Compare July 11, 2021 18:24

reword whatsnew

e9d4166

irgolic requested a review from jreback July 11, 2021 20:20

jbrockmendel reviewed Jul 13, 2021

View reviewed changes

pandas/core/internals/construction.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Jul 13, 2021

View reviewed changes

irgolic added 3 commits July 13, 2021 21:55

Revert "Revert "try not replacing _from_blocks with _from_array""

e91fb3a

This reverts commit 9e00908.

Drop outside if/else

96dd1b9

better performance?

dc2ae20

lint

aa45230

jreback added this to the 1.4 milestone Jul 14, 2021

jreback approved these changes Jul 14, 2021

View reviewed changes

jreback merged commit 830e9b1 into pandas-dev:master Jul 15, 2021

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

BUG: 2D ndarray of dtype 'object' is always copied upon construction (p…

b9cfae1

…andas-dev#39272)

BUG: 2D ndarray of dtype 'object' is always copied upon construction #39272

BUG: 2D ndarray of dtype 'object' is always copied upon construction #39272

Conversation

irgolic commented Jan 19, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irgolic Jan 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 19, 2021

jbrockmendel commented Jan 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irgolic commented Jan 20, 2021 • edited Loading

jbrockmendel commented Jan 20, 2021

pep8speaks commented Jan 20, 2021 • edited Loading

Comment last updated at 2021-07-14 18:45:43 UTC

irgolic commented Jan 20, 2021

irgolic commented Jan 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irgolic Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irgolic commented Jul 11, 2021

jbrockmendel commented Jul 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irgolic Jul 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irgolic commented Jul 13, 2021 • edited Loading

jbrockmendel commented Jul 14, 2021

jreback commented Jul 15, 2021

irgolic Jan 20, 2021 •

edited

Loading

irgolic commented Jan 20, 2021 •

edited

Loading

pep8speaks commented Jan 20, 2021 •

edited

Loading

irgolic Jan 22, 2021 •

edited

Loading

irgolic Jul 14, 2021 •

edited

Loading

irgolic commented Jul 13, 2021 •

edited

Loading