Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.loc/iloc fails to update object with new dtype #24269

Open
TomAugspurger opened this issue Dec 13, 2018 · 3 comments
Open

DataFrame.loc/iloc fails to update object with new dtype #24269

TomAugspurger opened this issue Dec 13, 2018 · 3 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@TomAugspurger
Copy link
Contributor

Similar to #4312 and #5702, but this seems specific to object dtype -> new dytpe.

In this first case, we likely convert the whole block, even though we just wanted A.

In [20]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, dtype=object)

In [21]: df.dtypes
Out[21]:
A    object
B    object
dtype: object

In [22]: df.loc[:, ['A']] = df.loc[:, ['A']].astype(int)

In [23]: df.dtypes
Out[23]:
A    int64
B    int64
dtype: object

In this one, (maybe a different bug), we fail to convert ['a', 'b'] to float when they start out in an object block with other values.

In [13]: df = pd.DataFrame([[np.nan, np.nan, 1, pd.Timestamp('2000')]], columns=['a', 'b', 'c', 'd'], dtype=object)

In [14]: df.loc[:, ['a', 'b']] = df.loc[:, ['a', 'b']].astype(float)

In [15]: df.dtypes
Out[15]:
a    object
b    object
c    object
d    object
dtype: object
@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Dec 13, 2018
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Dec 13, 2018
@slnguyen
Copy link

Something that might be useful to note is that when indexing with a single label we get the expected output. For example:

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, dtype=object)
df.loc[:, 'A'] = (df.loc[:, 'A']).astype(int)

df.dtype returns the expected:

A     int64
B    object

However as you mentioned before:

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, dtype=object)
df.loc[:, ['A']] = (df.loc[:, ['A']]).astype(int)

df.dtype returns

A    int64
B    int64
dtype: object

So the issue might have something to do with casting and indexing the dataframe with a list of labels versus single label?

@slnguyen
Copy link

slnguyen commented Dec 21, 2018

I'm not sure if this fix is the right way to resolve the problem, but I've added another control statement so that updating dataframe values corresponding to a list of labels is handled the same way as updating dataframe values corresponding to a single label. The performance test asv_bench/benchmarks/indexing.py returns that the benchmarks are not significantly changed.

if isinstance(indexer, tuple):
indexer = maybe_convert_ix(*indexer)
# if we are setting on the info axis ONLY
# set using those methods to avoid block-splitting
# logic here
if (len(indexer) > info_axis and
is_integer(indexer[info_axis]) and
all(com.is_null_slice(idx)
for i, idx in enumerate(indexer)
if i != info_axis) and
item_labels.is_unique):
self.obj[item_labels[indexer[info_axis]]] = value
return
if isinstance(value, (ABCSeries, dict)):
# TODO(EA): ExtensionBlock.setitem this causes issues with
# setting for extensionarrays that store dicts. Need to decide
# if it's worth supporting that.
value = self._align_series(indexer, Series(value))
elif isinstance(value, ABCDataFrame):
value = self._align_frame(indexer, value)
if isinstance(value, ABCPanel):
value = self._align_panel(indexer, value)
# check for chained assignment
self.obj._check_is_chained_assignment_possible()
# actually do the set
self.obj._consolidate_inplace()
self.obj._data = self.obj._data.setitem(indexer=indexer,
value=value)
self.obj._maybe_update_cacher(clear=True)

Changes were made based on my previous comment. When indexing via single label the code goes through lines 624-631 (see code snipped above). When indexing via a list of labels the code goes through lines 649-652.

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Bug labels Dec 23, 2018
@simonjayhawkins
Copy link
Member

In this one, (maybe a different bug), we fail to convert ['a', 'b'] to float when they start out in an object block with other values.

on master the first case fails to convert now as well.

>>> import numpy as np
>>>
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1029.gbdf969cd6'
>>>
>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, dtype=object)
>>>
>>> print(df.dtypes)
A    object
B    object
dtype: object
>>> # Out[21]:
>>> # A    object
>>> # B    object
>>> # dtype: object
>>>
>>> df.loc[:, ["A"]] = df.loc[:, ["A"]].astype(int)
>>>
>>> print(df.dtypes)
A    object
B    object
dtype: object
>>> # Out[23]:
>>> # A    int64
>>> # B    int64
>>> # dtype: object
>>>
>>> df = pd.DataFrame(
...     [[np.nan, np.nan, 1, pd.Timestamp("2000")]],
...     columns=["a", "b", "c", "d"],
...     dtype=object,
... )
>>>
>>> df.loc[:, ["a", "b"]] = df.loc[:, ["a", "b"]].astype(float)
>>>
>>> df.dtypes
a    object
b    object
c    object
d    object
dtype: object
>>> # Out[15]:
>>> # a    object
>>> # b    object
>>> # c    object
>>> # d    object
>>> # dtype: object
>>>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants