Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: 2D ndarray of dtype 'object' is always copied upon construction #39272

Merged
merged 20 commits into from
Jul 15, 2021

Conversation

irgolic
Copy link
Contributor

@irgolic irgolic commented Jan 19, 2021

This change is aesthetically unpleasing, it indents too far. #26825 already suggested to separate this part into a function. Please let me know what exactly I should abstract into a function, and where to put it. Alternatively, amend this PR yourself.

Thanks for pandas. :)

@irgolic irgolic force-pushed the 2d-object-dont-copy branch 6 times, most recently from 2ed6a20 to 40b0cea Compare January 19, 2021 22:22
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will have to measure performance if it changes in any way. run the asv suite. and see which benchmarks are affected.

I am not sure how this change actually helps anything. e.g. how does sharing memory actually help here? (because the second you do almost anything these blocks will likely get split apart and copied again).

so having a specific benchmark is really important here. sharing memory is not that interesting as its fleeting.

@@ -231,6 +231,7 @@ Datetimelike
- Bug in :meth:`DatetimeIndex.intersection`, :meth:`DatetimeIndex.symmetric_difference`, :meth:`PeriodIndex.intersection`, :meth:`PeriodIndex.symmetric_difference` always returning object-dtype when operating with :class:`CategoricalIndex` (:issue:`38741`)
- Bug in :meth:`Series.where` incorrectly casting ``datetime64`` values to ``int64`` (:issue:`37682`)
- Bug in :class:`Categorical` incorrectly typecasting ``datetime`` object to ``Timestamp`` (:issue:`38878`)
- Bug in :func:`pandas.core.internals.construction.init_ndarray` unnecessarily copying all object arrays after datetime inference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't add a whatsnew for internal functions, rather you want to note what a user would care about. so you can say about the same thing, but its about the DataFrame constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number

make_block(dvals_list[n], placement=[n], ndim=2)
for n in range(len(dvals_list))
]
i = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is not great. we already do this type of dtype grouping inside the block manager constructors. much better to do there. However this might be non-trivial and/or might have a bit impact performance.

so i would try to re-use one of the existing grouping rountines here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to the module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we have anything quite like this, in part because i wrote something similar in #33597

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you dont do the make_block calls here, but instead keep construct the arrays, and then instead of calling create_block_manager_from_blocks at the end could call create_block_manager_from_arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That solves it but has a ripple effect of making other stuff fail. That constructor might need to be adjusted for use in instantiating a DataFrame.
Screenshot 2021-01-20 at 02 14 22

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you post one of the actual tracebacks?

Copy link
Contributor Author

@irgolic irgolic Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these were a missing ValueError in _form_blocks, but it still ended up copying everything – that's just how create_block_manager_from_arrays is implemented. I've implemented a new _from_array constructor, which will fall back to _form_blocks in the case of a datetime conversion, but otherwise doesn't copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to avoid adding a kwarg to the constructor, but it seemed like that would get messier.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Internals Related to non-user accessible pandas implementation labels Jan 19, 2021
@jreback
Copy link
Contributor

jreback commented Jan 19, 2021

cc @jbrockmendel

@jbrockmendel
Copy link
Member

side note, after #38939 we can also avoid consolidating in this case, which would avoid making copies

)
else:
block_values.extend(
make_block(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens here if we have two consecutive datetime64tz columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In[5]: block_values
Out[5]: 
[DatetimeTZBlock: slice(0, 1, 1), 1 x 1, dtype: datetime64[ns, UTC],
 DatetimeTZBlock: slice(1, 2, 1), 1 x 1, dtype: datetime64[ns, UTC],
 DatetimeTZBlock: slice(2, 3, 1), 1 x 1, dtype: datetime64[ns, UTC]]

@irgolic
Copy link
Contributor Author

irgolic commented Jan 20, 2021

From what I can gather, the problem is that not all (but some) datetimelike types are dtypes. Checking equality between a dtype and a non-dtype throws an exception, which is likely why this wasn't implemented in the block manager constructor (where I assume everything works in dtypes?)

Edit: ignore this comment

@jbrockmendel
Copy link
Member

Checking equality between a dtype and a non-dtype throws an exception

Im guessing the issue is trying to do some_numpy_dtype == some_extension_dtype which does raise. if you use is_dtype_equal(some_numpy_dtype, some_extension_dtype)` you'll get a bool back

@irgolic irgolic force-pushed the 2d-object-dont-copy branch 3 times, most recently from 560f894 to 322a677 Compare January 20, 2021 17:20
@pep8speaks
Copy link

pep8speaks commented Jan 20, 2021

Hello @irgolic! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-14 18:45:43 UTC

@irgolic irgolic force-pushed the 2d-object-dont-copy branch 3 times, most recently from 57d8c6b to 1d95a25 Compare January 20, 2021 18:20
@irgolic
Copy link
Contributor Author

irgolic commented Jan 20, 2021

I am not sure how this change actually helps anything. e.g. how does sharing memory actually help here? (because the second you do almost anything these blocks will likely get split apart and copied again).

I'm a core developer of https://github.com/biolab/orange3, focusing on UX. Orange's data objects are also based on numpy arrays, Orange's Table exposes them as table.X, table.Y, table.metas. I'm trying to wrap our numpy arrays in DataFrame objects, and allow people to use pandas as an accessor in Table (table.X_df, table.Y_df, table.metas_df) (biolab/orange3#5189).

While you're right in that most operations will split it apart anyway, in my use case, it'll avoid an unnecessary copy operation.

you will have to measure performance if it changes in any way. run the asv suite. and see which benchmarks are affected.

On it. If none of the benchmarks change, I'll add one.

@irgolic irgolic force-pushed the 2d-object-dont-copy branch 4 times, most recently from cd2281e to 4ef2e7e Compare January 21, 2021 04:26
@irgolic
Copy link
Contributor Author

irgolic commented Jan 21, 2021

Let me know if you prefer to change create_block_manager_from_arrays, as is done in the latest commit, and I'll fix it up.

As the function is used elsewhere too, it makes a test fail:

def test_constructor_series_copy(self, float_frame):
    series = float_frame._series
  
    df = DataFrame({"A": series["A"]})
  
    df["A"][:] = 5
  
    assert not (series["A"] == 5).all()  # False

So, create_block_manager_from_arrays won't create a copy of the series, because it's unnecessary to do so. There's only one test in pandas relying on this behavior, I'm not sure how much code around the world relies on it.
If the DataFrame contains anything more than a single series/array, it'll copy. This is how I adjusted the test too:

def test_constructor_series_copy(self, float_frame):
    series = float_frame._series
  
    df = DataFrame({"A": series["A"], "B": series["A"]})
  
    df["A"][:] = 5
  
    assert not (series["A"] == 5).all()  # True

names_idx = names
if names_idx.equals(axes[0]):
names_indexer = np.arange(len(names_idx))
if len(arrays) == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure needs comments, and even better pls separate this out to a new function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this looks almost like a groupby to me. i am puzzled why the axes are involved here at all.

Copy link
Contributor Author

@irgolic irgolic Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it was already written. It finds all variables of the same block type, and makes blocks out of them. If you pass in more arrays than names, only those arrays whose names were specified are blockified.

I could rewrite this into a groupby, I guess. It'll create more contiguous blocks than it does currently.

If you look at create_block_manager_from_blocks, you'll see that it allows passing a singular array (not block), and it'll create a contiguous block for it. I could do this in create_block_manager_from_arrays, but IIRC that made a lot of things fail, like init_dict, which uses the same constructor. Maybe #38939 will fix that.

I still think making a new constructor would be more elegant. It would allow removal of this weird bit in create_block_manager_from_blocks too:

def create_block_manager_from_blocks(blocks, axes: List[Index]) -> BlockManager:
    try:
        if len(blocks) == 1 and not isinstance(blocks[0], Block):
            # if blocks[0] is of length 0, return empty blocks
            if not len(blocks[0]):
                blocks = []
            else:
                # It's OK if a single block is passed as values, its placement
                # is basically "all items", but if there're many, don't bother
                # converting, it's an error anyway.
                blocks = [
                    make_block(
                        values=blocks[0], placement=slice(0, len(axes[0])), ndim=2
                    )
                ]

        ...

It's unintuitive to pass something that's not a block to a method suffixed with _from_blocks.


if len(arrays) == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add omments

doc/source/whatsnew/v1.3.1.rst Outdated Show resolved Hide resolved
@@ -1845,6 +1845,14 @@ def test_constructor_series_copy(self, float_frame):

assert not (series["A"] == 5).all()

def test_object_array_does_not_copy(self):
a = np.array(["a", "b"], dtype="object")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

assert np.shares_memory(df.values, a)
df2 = DataFrame(b)
assert np.shares_memory(df2.values, b)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert the frames are equal as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean assert the numpy arrays equal? They're not equal, but they do share memory:

>       np.testing.assert_array_equal(df.values, a)
E       AssertionError: 
E       Arrays are not equal
E       
E       (shapes (2, 1), (2,) mismatch)
E        x: array([['a'],
E                  ['b']], dtype=object)
E        y: array(['a', 'b'], dtype=object)

(note also I've split the test into two)

pandas/core/internals/construction.py Outdated Show resolved Hide resolved
@irgolic irgolic force-pushed the 2d-object-dont-copy branch 5 times, most recently from c0e144f to d271764 Compare July 11, 2021 17:47
@irgolic
Copy link
Contributor Author

irgolic commented Jul 11, 2021

I finally figured out what was failing. It was the new ArrayManager. I just needed to add @td.skip_array_manager_invalid_test on the test.

The way the new array data management is set up currently is such that it copies all incoming numpy arrays. Our use case for pandas relies on 'wrapping' numpy arrays without copying them. Please consider this when enforcing the new array data manager. Although I'm assuming that this won't be standard for a while.

Until then, please accept this contribution which fixes the 2D object array copy bug in block manager. Note, I've also refactored that bit and made a create_block_manager_from_array methods instead of _from_blocks. create_block_manager_from_blocks is now not used anywhere in the pandas codebase anymore, but dask calls this pandas internal, so it should probably be kept there for now.
@jreback Sorry for bouncing back and forth between the two implementations. I think this one's cleaner, let me know if I should change it back or amend it.

(also not sure why it says pandas-dev.pandas failed, all the tests passed)

@irgolic irgolic requested a review from jreback July 11, 2021 20:20
@jbrockmendel
Copy link
Member

The way the new array data management is set up currently is such that it copies all incoming numpy arrays. Our use case for pandas relies on 'wrapping' numpy arrays without copying them. Please consider this when enforcing the new array data manager. Although I'm assuming that this won't be standard for a while.

IIRC there was some discussion about this copy being avoided, not sure what happened to that. Opening an issue about it might be worthwhile.

# don't convert (and copy) the objects if no type inference occurs
if any(
not is_dtype_equal(instance.dtype, array.dtype)
for instance in maybe_datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-check me on this, but i think a more performant alternative would be

obj_columns = [x for x in array]
maybe_datetime = [maybe_infer_to_datetimelike(x) for x in obj_columns]
if any(x is not y for x, y in zip(obj_columns, maybe_datetime):
    [...]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running asv to see if there's a performance difference

Copy link
Contributor Author

@irgolic irgolic Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the full asv run comparing the last commit (your suggestion) and the second to last commit. Any of these in particular you'd like me to rerun?

       before           after         ratio                                                                                                                  
     [dc2ae20d]       [96dd1b9e]                                                                                                                     
     <2d-object-dont-copy>       <2d-object-dont-copy~1>                                                                                                     
+     4.43±0.01ms       94.2±0.3ms    21.27  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'uniqu
e_monotonic_inc')                                                                                                                                            
+        59.0±3ms       75.8±0.4ms     1.28  inference.ToTimedeltaErrors.time_convert('coerce')                                                              
+        45.8±5ms         58.5±5ms     1.28  gil.ParallelGroupbyMethods.time_loop(2, 'last')                                                                 
+      4.37±0.2ms       5.34±0.3ms     1.22  rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'max')                           
+      5.31±0.1ms       6.45±0.5ms     1.21  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'sum')            
+      4.46±0.1ms       5.40±0.3ms     1.21  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'mean')                       
+      5.38±0.2ms       6.51±0.6ms     1.21  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'max')              
+         147±9ms          178±1ms     1.21  inference.ToDatetimeISO8601.time_iso8601_tz_spaceformat                                                         
+        72.1±5ms         86.7±1ms     1.20  inference.ToDatetimeCache.time_dup_string_tzoffset_dates(False)                                                 
+     4.57±0.09ms       5.47±0.3ms     1.20  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'max')                        
+      5.23±0.2ms       6.23±0.4ms     1.19  rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'mean')              
+      5.40±0.1ms       6.42±0.5ms     1.19  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'median')         
+      4.29±0.1ms       5.10±0.3ms     1.19  rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'mean')                            
+      4.52±0.2ms       5.37±0.2ms     1.19  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'median')                       
+      4.45±0.2ms       5.28±0.4ms     1.19  rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'mean')                          
+      15.4±0.6μs       18.2±0.3μs     1.19  series_methods.NanOps.time_func('sum', 1000, 'boolean')                                                         
+      4.39±0.1ms       5.20±0.2ms     1.19  rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'max')                             
+      5.25±0.2ms       6.23±0.4ms     1.18  rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'median')              
+      5.20±0.2ms       6.16±0.3ms     1.18  rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'sum')               
+      4.37±0.2ms       5.17±0.3ms     1.18  rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'median')                          
+     4.53±0.04ms       5.35±0.3ms     1.18  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'min')                          
+      4.35±0.1ms       5.13±0.3ms     1.18  rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'sum')                             
+      4.40±0.1ms       5.18±0.3ms     1.18  rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'median')                        
+      5.25±0.2ms       6.17±0.2ms     1.17  rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'min')               
+      5.39±0.2ms       6.32±0.4ms     1.17  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'min')              
+        58.8±2ms         68.9±7ms     1.17  inference.ToTimedeltaErrors.time_convert('ignore')                                                              
+      4.42±0.2ms       5.17±0.2ms     1.17  rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'min')                           
+      5.33±0.1ms       6.23±0.2ms     1.17  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'min')            
+      5.38±0.3ms       6.27±0.4ms     1.17  rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'max')                 
+      4.73±0.3ms       5.51±0.4ms     1.17  rolling.Engine.time_expanding_apply('Series', 'int', <function sum>, 'cython', 'max')                           
+      5.20±0.1ms       6.06±0.3ms     1.16  rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'min')                 
+      4.60±0.2ms       5.36±0.3ms     1.16  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'max')                          
+      13.2±0.4μs       15.3±0.3μs     1.16  series_methods.NanOps.time_func('max', 1000, 'boolean')                                                         
+     4.40±0.06ms       5.11±0.4ms     1.16  rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'sum')                           
+      4.57±0.1ms       5.29±0.3ms     1.16  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'sum')                          
+     4.67±0.09ms       5.41±0.2ms     1.16  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'mean')                         
+      5.39±0.1ms       6.23±0.5ms     1.16  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'mean')           
+      5.21±0.1ms       6.02±0.4ms     1.16  rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'max')               
+     4.59±0.03ms       5.29±0.2ms     1.15  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'median')                     
+      5.31±0.1ms       6.12±0.3ms     1.15  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'mean')             
+     4.55±0.03ms       5.24±0.3ms     1.15  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'sum')                        
+      5.37±0.1ms       6.19±0.4ms     1.15  rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'sum')              
+      16.4±0.7μs         18.7±1μs     1.14  series_methods.SearchSorted.time_searchsorted('uint8')                                                          
+      5.26±0.1ms       5.99±0.3ms     1.14  rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'median')            
+      4.64±0.1ms       5.29±0.3ms     1.14  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'min')                        
+     1.17±0.01ms      1.33±0.07ms     1.14  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 5
00000)                                                                                                                                   
+     2.58±0.03μs      2.93±0.08μs     1.13  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Float64Engine'>, <class 'numpy.f
loat64'>), 'monotonic_incr')                                                                                                                                 
+      5.21±0.2ms       5.91±0.3ms     1.13  rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'mean')                
+        4.28±0ms       4.84±0.5ms     1.13  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 10000, datetime.timezone(datetime.timedelta(se
conds=3600)))                                                                                                                                                
+      4.51±0.2ms       5.10±0.2ms     1.13  rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'min')                             
+      5.50±0.2ms       6.21±0.2ms     1.13  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'max')            
+      16.5±0.4μs         18.6±1μs     1.13  series_methods.SearchSorted.time_searchsorted('int16')                                                          
+     2.57±0.01μs       2.89±0.2μs     1.13  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int64Engine'>, <class 'numpy.int
64'>), 'monotonic_incr')                                                                                                                                     
+       452±0.8ms         508±50ms     1.12  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 1000000, datetime.timezone(datetime.timedelta(
seconds=3600)))                                                                                                                                              
+        806±20μs         904±40μs     1.12  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ne>)   
+     3.20±0.04μs       3.59±0.2μs     1.12  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'month', 'QS', 3)                       
+      13.2±0.5μs       14.8±0.6μs     1.12  series_methods.NanOps.time_func('min', 1000, 'boolean')                                                         
+         121±5μs          136±4μs     1.12  multiindex_object.GetLoc.time_large_get_loc                                                                     
+      47.6±0.2ms         53.0±1ms     1.11  rolling.TableMethod.time_apply('single')                                                                        
+     4.17±0.02μs       4.64±0.2μs     1.11  categoricals.CategoricalSlicing.time_getitem_scalar('non_monotonic')                                            
+      15.9±0.3ms       17.7±0.2ms     1.11  inference.ToTimedelta.time_convert_string_days                                                                  
+        741±20μs         824±20μs     1.11  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function eq>)       
+        732±20μs         813±20μs     1.11  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function le>)   
+     3.32±0.03μs       3.67±0.2μs     1.11  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'year', 'B', 5)                         
+     3.44±0.02μs       3.81±0.3μs     1.11  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'year', 'B', 5)                         
+       113±0.7μs          125±1μs     1.11  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monoton
ic_inc')                                                                                                                                                     
+     4.16±0.02μs       4.60±0.4μs     1.10  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_incr')                                           
+      13.0±0.2μs       14.3±0.2μs     1.10  series_methods.NanOps.time_func('sum', 1000, 'Int64')                                                           
+        60.3±1ms         66.6±1ms     1.10  rolling.Groupby.time_rolling_int('mean')                                                                        
+      14.4±0.1μs       15.9±0.5μs     1.10  series_methods.NanOps.time_func('min', 1000, 'Int64')                                                           
+        60.7±1ms         67.0±1ms     1.10  rolling.Groupby.time_rolling_int('max')                                                                         
+        61.5±1ms         67.9±1ms     1.10  rolling.Groupby.time_rolling_int('median')                                                                      
+        544±20ns         600±30ns     1.10  index_cached_properties.IndexCache.time_inferred_type('Int64Index')                                             
+     15.8±0.04μs       17.4±0.8μs     1.10  inference.MaybeConvertObjects.time_maybe_convert_objects                                                        
+       437±0.7ms         482±40ms     1.10  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 1000000, None)                                
+        60.5±2ms         66.6±1ms     1.10  rolling.Groupby.time_rolling_int('sum' (0))                                                                     
+     3.36±0.01μs       3.70±0.2μs     1.10  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'end', 'year', 'B', 12)                          
+     3.24±0.04μs       3.57±0.2μs     1.10  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'quarter', 'QS', 5)                     
+         142±2μs          157±1μs     1.10  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_mo
notonic_inc')                                                                                                                                                
+      41.1±0.3ms         45.2±1ms     1.10  frame_methods.Isnull.time_isnull_obj                                                                            
+     3.47±0.02μs       3.82±0.2μs     1.10  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'end', 'quarter', 'QS', 12)                      
-      8.68±0.5μs      7.89±0.02μs     0.91  tslibs.resolution.TimeResolution.time_get_resolution('h', 0, datetime.timezone(datetime.timedelta(seconds=3600))
)                                                                                                                                                            
-      8.79±0.6μs      7.98±0.04μs     0.91  tslibs.resolution.TimeResolution.time_get_resolution('D', 1, datetime.timezone(datetime.timedelta(seconds=3600))
)                                                                                                                                                            
-        18.0±1μs      16.3±0.05μs     0.91  tslibs.resolution.TimeResolution.time_get_resolution('ns', 1, tzlocal())                                        
-      9.93±0.6μs      9.01±0.02μs     0.91  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 4000, datetime.timezone(datetime.timedelta(sec
onds=3600)))                                                                                                                                                 
-      1.10±0.09s          998±5ms     0.91  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<CustomBusinessMonthEnd> (0))
-      2.71±0.2μs         2.46±0μs     0.91  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('date', 0, None)
-      9.98±0.7μs      9.05±0.03μs     0.91  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      2.75±0.2μs      2.50±0.02μs     0.91  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, tzlocal())
-      2.72±0.2μs         2.46±0μs     0.91  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, None)
-      9.98±0.5μs      9.03±0.09μs     0.91  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 5000, datetime.timezone(datetime.timedelta(sec
onds=3600)))

-      9.11±0.3μs       8.24±0.3μs     0.90  tslibs.timestamp.TimestampProperties.time_weekday_name(tzlocal(), None)
-      9.91±0.6μs       8.96±0.1μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 12000, datetime.timezone(datetime.timedelta(se
conds=3600)))
-      10.0±0.6μs      9.06±0.04μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 3000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      10.1±0.8μs      9.09±0.03μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 8000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      9.94±0.7μs       8.99±0.1μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 10000, datetime.timezone(datetime.timedelta(se
conds=3600)))
-        19.8±1μs      17.9±0.05μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(100, 2000, <DstTzInfo 'US/Pacific' LMT-1 day, 16:
07:00 STD>)
-      1.11±0.09s          998±3ms     0.90  arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<CustomBusinessMonthEnd> (0))
-      9.91±0.6μs      8.95±0.08μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 7000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      9.99±0.5μs      9.01±0.09μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      2.58±0.2μs      2.32±0.01μs     0.90  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, datetime.timezone.utc)
-      9.97±0.7μs      8.96±0.06μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 11000, datetime.timezone(datetime.timedelta(se
conds=3600)))
-      29.3±0.3ms       26.2±0.5ms     0.90  rolling.Apply.time_rolling('Series', 3, 'int', <built-in function sum>, False)
-      10.1±0.7μs      9.03±0.06μs     0.89  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 9000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      2.60±0.2μs      2.32±0.01μs     0.89  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, datetime.timezone.utc)
-      10.0±0.7μs      8.98±0.04μs     0.89  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      8.65±0.2μs       7.73±0.1μs     0.89  tslibs.period.TimePeriodArrToDT64Arr.time_periodarray_to_dt64arr(1, 7000)
-      9.99±0.7μs      8.93±0.05μs     0.89  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-        26.7±2μs       23.8±0.2μs     0.89  tslibs.timestamp.TimestampOps.time_normalize(tzlocal())
-      2.77±0.2μs      2.47±0.02μs     0.89  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, None)
-      6.29±0.6μs      5.60±0.08μs     0.89  tslibs.timedelta.TimedeltaConstructor.time_from_unit
-      11.6±0.8μs       10.3±0.1μs     0.89  tslibs.timestamp.TimestampOps.time_replace_None(datetime.timezone(datetime.timedelta(seconds=3600)))
-      10.1±0.6μs      8.95±0.06μs     0.89  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 10000, datetime.timezone(datetime.timedelta(se
conds=3600)))
-      5.64±0.3μs      5.00±0.04μs     0.89  tslibs.timestamp.TimestampOps.time_normalize(None)
-        58.8±3μs       51.9±0.6μs     0.88  tslibs.timestamp.TimestampOps.time_normalize(tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
-      3.19±0.3μs      2.80±0.02μs     0.88  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('date', 1, None)
-        9.72±1μs      8.54±0.05μs     0.88  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, datetime.timezone(datetime.timedelta(s
econds=3600)))
-      2.85±0.2μs      2.50±0.01μs     0.88  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, tzlocal())
-     1.27±0.03ms      1.11±0.06ms     0.87  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function mul>)
-        15.5±2μs       13.6±0.2μs     0.87  tslibs.timedelta.TimedeltaConstructor.time_from_components
-        8.74±1μs      7.57±0.05μs     0.87  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, <DstTzInfo 'US/Pacific' LMT-1 day, 16:07
:00 STD>)
-      8.05±0.8μs      6.92±0.03μs     0.86  tslibs.timestamp.TimestampOps.time_replace_None(<DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>)
-     1.59±0.01ms      1.36±0.01ms     0.85  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(100, 4000, tzlocal())
-      6.61±0.6μs      5.65±0.08μs     0.85  tslibs.timedelta.TimedeltaConstructor.time_from_string
-        24.1±2μs       20.5±0.1μs     0.85  tslibs.timestamp.TimestampOps.time_normalize(datetime.timezone(datetime.timedelta(seconds=3600)))
-     11.7±0.05ms      9.84±0.06ms     0.84  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_mon
otonic_inc')
-      1.66±0.1ms      1.39±0.07ms     0.84  series_methods.NanOps.time_func('prod', 1000000, 'int8')
-      10.8±0.2μs      9.05±0.04μs     0.84  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 4006, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      10.7±0.3μs       8.95±0.2μs     0.83  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, datetime.timezone(datetime.timedelta(sec
onds=3600)))
-      1.76±0.2ms       1.46±0.3ms     0.83  index_cached_properties.IndexCache.time_engine('MultiIndex')
-      57.4±0.4ms      34.8±0.07ms     0.61  algos.isin.IsinWithArange.time_isin(<class 'numpy.object_'>, 8000, 0)
-        93.1±1ms          400±3μs     0.00  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reran time_getitem_list_like, now it swung in the other direction, seems like it's just doing whatever it feels like.

       before           after         ratio
     [dc2ae20d]       [96dd1b9e]
     <2d-object-dont-copy>       <2d-object-dont-copy~1>
-     59.7±0.08ms      3.17±0.02ms     0.05  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')

@irgolic
Copy link
Contributor Author

irgolic commented Jul 13, 2021

This is what it looks like now:

    if dtype is None and is_object_dtype(values.dtype):
        obj_columns = [x for x in values]
        maybe_datetime = [maybe_infer_to_datetimelike(x) for x in obj_columns]
        # don't convert (and copy) the objects if no type inference occurs
        if any(x is not y for x, y in zip(obj_columns, maybe_datetime)):
            dvals_list = [ensure_block_shape(dval, 2) for dval in maybe_datetime]
            block_values = [
                new_block(dvals_list[n], placement=n, ndim=2)
                for n in range(len(dvals_list))
            ]
        else:
            nb = new_block(values, placement=slice(len(columns)), ndim=2)
            block_values = [nb]
    else:
        nb = new_block(values, placement=slice(len(columns)), ndim=2)
        block_values = [nb]

Still not too pretty because of the double else. We could avoid it by abstracting the boolean condition into a separate private function. However without a walrus operator, we'd need to call maybe_infer_to_datetimelike twice (unless I'm missing something). Would that be fine, or should I leave it as is?

@jbrockmendel
Copy link
Member

Still not too pretty because of the double else

im not bothered by this at all

@jreback jreback added this to the 1.4 milestone Jul 14, 2021
@jreback jreback merged commit 830e9b1 into pandas-dev:master Jul 15, 2021
@jreback
Copy link
Contributor

jreback commented Jul 15, 2021

thanks @irgolic thanks for sticking with it! very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: 2D ndarray of dtype 'object' is always copied upon construction
5 participants