BUG: Cannot convert existing column to categorical #52593

Zahlii · 2023-04-11T12:43:27Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

x = pd.DataFrame({
    "A": pd.Categorical(["A", "B"], categories=["A", "B"]),
    "B": [1,2],
    "C": ["D", "E"]
})
print(x.dtypes)
x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x.dtypes)

Issue Description

When setting an existing column to its categorical equivalent, the underlying dtypes stay the same.

Expected Behavior

Output now:

A category
B int64
C object
dtype: object
A category
B int64
C object
dtype: object

Expected output:
A category
B int64
C object
dtype: object
A category
B int64
C category <----
dtype: object

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.0
numpy : 1.23.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.2
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : 3.0.8
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.0.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.1.0
gcsfs : None
matplotlib : 3.7.0
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : 1.4.46
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

phofl · 2023-04-11T20:59:52Z

commit 4e4be0bfa8f74b9d453aa4163d95660c04ffea0c
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 21 11:57:24 2022 -0800

    DEPR: enforce inplaceness for df.loc[:, foo]=bar (#49775)
    
    * DEPR: enforce inplaceness for df.loc[:, foo]=bar

#49775

cc @jbrockmendel

Our checks aren't strict enough. This is also a problem when trying to set string columns, probably everything when setting into object.

jbrockmendel · 2023-04-12T14:32:26Z

isnt this a case where the user should use df["C"] = ... instead of df.loc[:, "C"] = ...?

Zahlii · 2023-04-12T15:07:40Z

isnt this a case where the user should use df["C"] = ... instead of df.loc[:, "C"] = ...?

Well, in our current codebase we have established that we want to be absolutely clear to which axis we are referring to when setting items, i.e. we explicitly use df.loc[X, :] = ... to set rows, and df.loc[:, X] to set columns. Otherwise, we had the issue that df["C"] in some cases unexplicably set the row instead of the column when we were dealing with pandas dataframes where index = columns (i.e. for transition matrices). In any case, it has worked this way since pandas 0.x, so I would assume that several other people may face the same issue; what's worse it that it just silently ignores the users intention. The only reason why we noticed this was due to automated nightly tests.

jbrockmendel · 2023-04-20T21:19:21Z

Our checks aren't strict enough. This is also a problem when trying to set string columns, probably everything when setting into object.

@phofl can you expand on this? im not sure which checks you're referring to

glemaitre · 2023-05-27T10:08:27Z

I stumbled into this issue as well with other types:

# %%
import pandas as pd

df = pd.DataFrame({
    'x': ["2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04", "2012-01-05"],
    'y': [2, 4, 6, 8, 10],
})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       5 non-null      object
 1   y       5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes

# %%
df.loc[:, "x"] = pd.to_datetime(df.loc[:, "x"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       5 non-null      object
 1   y       5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes

where the columns "x" as not been casted to datetime64[ns] as in pandas 1.5.

I would personally argue that using df["x"] = pd.to_datetime(df["x"]) is counter-intuitive since it is close to the issue of the chained indexing raising the "SettingWithCopyWarning":

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["x"][:] = 10

I, therefore, agree with the argument of @Zahlii regarding .loc being explicit (cf. #52593 (comment)).

jorisvandenbossche · 2023-05-31T08:01:22Z

Essentially whenever your original column is object dtype, doing a df.loc[:, "col"] = ... will now never preserve the dtype of the values you are setting, because you can set anything in an object dtype array ..

That seems a quite serious usability regression, that we might not have fully thought through when making this change.

I also think a lot of people (and teachers, tutorials, ec) are recommending to use loc to be more explicit, so the df.loc[: "col"] = .. is very widespread.
If we decide to keep this change as is, we should at least give this a lot more visibility in our docs and release notes.

jorisvandenbossche · 2023-05-31T08:13:08Z

And we should maybe consider to revert the behaviour (partly? just for object dtype?) so we can do another attempt to add a more focused deprecation warning? (the cases that have been reported in the several issues should be relatively easy to detect)

jbrockmendel · 2023-05-31T18:33:40Z

And we should maybe consider to revert the behaviour (partly? just for object dtype?) so we can do another attempt to add a more focused deprecation warning?

In 1.5.3

df = pd.DataFrame({
    'x': ["2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04", "2012-01-05"],
    'y': [2, 4, 6, 8, 10],
})

>>> df.loc[:, "x"] = pd.to_datetime(df.loc[:, "x"])
<stdin>:1: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`

What warning would you issue instead?

I'd be against reverting as the old behavior involved a ton of inconsistencies. I could be OK with adding a warning specific to object-dtype-and-full-slice cases.

jorisvandenbossche · 2023-06-05T15:39:46Z

Sorry, I was confused with another deprecation we reverted in 1.5.x (slicing with ints), for the deprecation here we in the end only changed it from FutureWarning to DeprecationWarning to make it less visible: #48673.
So yes, we certainly still did raise a warning in 1.5.x (although it seems we never fixed the message, as that is still giving the alternative with positional indices).

I could be OK with adding a warning specific to object-dtype-and-full-slice cases.

You mean adding a general UserWarning when you do df.loc[:, "col"] for an object dtype column (and setting with an object that has a specific (and non-object) dtype), in the assumption you basically never want to do this (because now you always loose the dtype of the values you are setting, which is probably never going to be the intent of the user?)

jbrockmendel · 2023-06-05T17:56:16Z

You mean adding a general UserWarning when you do df.loc[:, "col"] for an object dtype column (and setting with an object that has a specific (and non-object) dtype), in the assumption you basically never want to do this (because now you always loose the dtype of the values you are setting, which is probably never going to be the intent of the user?)

Yes.

Zahlii · 2023-06-23T07:19:34Z

I would like to expand on this, I find it very confusing even with the above notice.
I did some further testing, and for me it is VERY confusing that df[ABC] operates on columns as long as dtype(ABC) == dtype(columns), but on rows if otherwise, i.e. with boolean indexing.

Consider the following extended scenario. Tasks that are common in preprocessing, such as converting batches of columns based on e.g. a boolean mask, is no longer possible (easily), as using boolean masks to operate on columns is no longer possible, instead we have to use df[df.columns[mask]].

import pandas as pd
from pandas.core.dtypes.common import is_categorical_dtype

x = pd.DataFrame({
    "A": pd.Categorical(["A", "B", "B"], categories=["A", "B"]),
    "B": [1,2, 3],
    "C": ["D", "E", "E"]
}, index=["A", "B", "C"])
print(">Original")
print(x, "\n", x.dtypes)
print(">Set Categorical")
# doesn't work
x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x, "\n", x.dtypes)

print(">Set Categorical Direct")
# works
x["C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x, "\n", x.dtypes)

print(">Convert all Categories back to Str")
mask_cat = x.dtypes.map(is_categorical_dtype).values
# this won't work
x.loc[:, mask_cat] = x.loc[:, mask_cat].astype(str)
print(x, "\n", x.dtypes)

print(">Convert all Categories back to Str Direct")
# this won't work, either, because masks operate on ROWS
x[mask_cat] = x[mask_cat].astype(str)
print(x, "\n", x.dtypes)


print(">Convert all Categories back to Str via Columns")
# this works, but it is a pain in the ass
mask_cat_cols = x.columns[mask_cat]
x[mask_cat_cols] = x[mask_cat_cols].astype(str)
print(x, "\n", x.dtypes)

aalyousfi · 2023-07-23T07:51:58Z

I'm facing the same issue. A simplified example:

df["month"] = pd.to_datetime(df["month"])
some_date = datetime.utcnow().date() - relativedelta(months=25)
df = df.loc[df.month.dt.date >= some_date]

I get an error:

Can only use .dt accessor with datetimelike values.

It's confusing. Is this behavior documented somewhere? It's a breaking change. Or is it a bug?

jbrockmendel · 2023-09-19T15:04:45Z

The OP example looks like it works on main. Can anyone else confirm?

lithomas1 · 2023-09-21T12:32:56Z

Still seems to fail for me

>>> x = pd.DataFrame({
...     "A": pd.Categorical(["A", "B"], categories=["A", "B"]),
...     "B": [1,2],
...     "C": ["D", "E"]
... })
>>> print(x.dtypes)
A    category
B       int64
C      object
dtype: object
>>> x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
>>> print(x.dtypes)
A    category
B       int64
C      object
dtype: object

C still seems to be object.

tgy · 2023-09-21T16:44:01Z

I have the same issue

import pandas as pd

d = pd.DataFrame(dict(wanna_be_cat=['A', 'B', 'B']))

d.loc[:, 'wanna_be_cat'] = d['wanna_be_cat'].astype('category')

# 'wanna_be_cat' column not modified to become categorical
print(d['wanna_be_cat'].dtype)

d['wanna_be_cat'] = d['wanna_be_cat'].astype('category')

# 'wanna_be_cat' column modified to become categorical
print(d['wanna_be_cat'].dtype)

the fact that it works without .loc but doesn't with .loc is very confusing.

ManuelNavarroGarcia · 2023-11-22T08:57:16Z

I have a similar problem in this example with pandas == 2.1.2:

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [[1, 2, 3], [4, 5], [6]]})
>>> cols = df.select_dtypes("integer").columns
>>> df.loc[:, cols].dtypes
a    int64
dtype: object
>>> df.loc[:, cols] = df.loc[:, cols].apply(pd.to_numeric, downcast="integer")
>>> df.loc[:, cols].dtypes
a    int64
dtype: object
>>> df[cols] = df[cols].apply(pd.to_numeric, downcast="integer")
>>> df.loc[:, cols].dtypes
a    int8
dtype: object

I would assume that both ways to insert cols to the original DataFrame should be equivalent, but they are not.

n-splv · 2024-02-12T13:35:58Z

Just wanted to convert dtype (float -> int) of multiple columns in my df and encountered this issue:

# This doesn't work
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)

# Neither does this
df.loc[:, df.columns[2:]] = df.loc[:, df.columns[2:]].astype(int)

# This works
for col in df.columns[2:]:
    df[col] = df[col].astype(int)

pandas 2.1.3

bcrotty · 2024-03-14T18:13:45Z

I have the same behavior as @n-splv on pandas 2.2.1, but if the correct solution is to not use .loc, then I don't think there should be a SettingWithCopyWarning.

df = pd.DataFrame({'number': [1.0, 2.0, 3.0]})
cp = df.loc[df["number"] % 2 == 1]

# This does not change the dtype
cp.loc[:, "number"] = cp["number"].astype(int)

# This changes the dtype but produces a SettingWithCopyWarning
cp["number"] = cp["number"].astype(int)

Or is there a better way to change the dtype?

yuanx749 · 2024-04-22T06:07:51Z

Facing the same issue. For the time being, I used:

df = df.astype({col: "category" for col in categorical_cols})

Zahlii added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023

phofl added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023

phofl added this to the 2.0.1 milestone Apr 11, 2023

phofl added the Regression Functionality that used to work in a prior pandas version label Apr 11, 2023

phofl mentioned this issue Apr 12, 2023

BUG: df.loc[:, col] type assignments doesn't propagate dtype changes #52612

Closed

3 tasks

phofl mentioned this issue Apr 17, 2023

BUG: dtype conversion fails on loc after pivot #52668

Closed

3 tasks

datapythonista modified the milestones: 2.0.1, 2.0.2 Apr 23, 2023

StefanieSenger mentioned this issue Apr 30, 2023

BUG: dtype changes when assigning on a slice or list of labels with .loc or iloc #52965

Open

3 tasks

rhshadrach mentioned this issue May 2, 2023

BUG: does not change column type when using .loc & to_datetime #53018

Closed

3 tasks

mj0nez mentioned this issue May 11, 2023

BUG: .dt accessor not available after loc-assignment #53172

Closed

3 tasks

datapythonista modified the milestones: 2.0.2, 2.0.3 May 26, 2023

lithomas1 modified the milestones: 2.0.3, 2.0.4 Jun 27, 2023

briochh mentioned this issue Aug 10, 2023

more pandas fixes pypest/pyemu#451

Merged

lithomas1 modified the milestones: 2.0.4, 2.1.1 Aug 30, 2023

lithomas1 modified the milestones: 2.1.1, 2.1.2 Sep 21, 2023

rhshadrach mentioned this issue Sep 23, 2023

ENH: Bring back the observed argument for groupby on Categorical columns #55237

Closed

3 tasks

mroeschke mentioned this issue Sep 26, 2023

BUG: Pandas 2.x iloc assignment does not reassign in some scenarios #55302

Closed

3 tasks

lithomas1 modified the milestones: 2.1.2, 2.1.3 Oct 26, 2023

lukemanley mentioned this issue Oct 29, 2023

BUG: #55717

Closed

3 tasks

jorisvandenbossche modified the milestones: 2.1.3, 2.1.4 Nov 13, 2023

lithomas1 modified the milestones: 2.1.4, 2.2 Dec 8, 2023

lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024

lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024

asishm mentioned this issue Mar 30, 2024

BUG: Can't change datetime precision in columns/rows #57838

Open

3 tasks

lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Cannot convert existing column to categorical #52593

BUG: Cannot convert existing column to categorical #52593

Zahlii commented Apr 11, 2023

INSTALLED VERSIONS

phofl commented Apr 11, 2023 •

edited by jorisvandenbossche

jbrockmendel commented Apr 12, 2023

Zahlii commented Apr 12, 2023

jbrockmendel commented Apr 20, 2023

glemaitre commented May 27, 2023

jorisvandenbossche commented May 31, 2023 •

edited

jorisvandenbossche commented May 31, 2023

jbrockmendel commented May 31, 2023

jorisvandenbossche commented Jun 5, 2023

jbrockmendel commented Jun 5, 2023

Zahlii commented Jun 23, 2023

aalyousfi commented Jul 23, 2023

jbrockmendel commented Sep 19, 2023

lithomas1 commented Sep 21, 2023

tgy commented Sep 21, 2023

ManuelNavarroGarcia commented Nov 22, 2023 •

edited

n-splv commented Feb 12, 2024 •

edited

bcrotty commented Mar 14, 2024 •

edited

yuanx749 commented Apr 22, 2024

BUG: Cannot convert existing column to categorical #52593

BUG: Cannot convert existing column to categorical #52593

Comments

Zahlii commented Apr 11, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

phofl commented Apr 11, 2023 • edited by jorisvandenbossche

jbrockmendel commented Apr 12, 2023

Zahlii commented Apr 12, 2023

jbrockmendel commented Apr 20, 2023

glemaitre commented May 27, 2023

jorisvandenbossche commented May 31, 2023 • edited

jorisvandenbossche commented May 31, 2023

jbrockmendel commented May 31, 2023

jorisvandenbossche commented Jun 5, 2023

jbrockmendel commented Jun 5, 2023

Zahlii commented Jun 23, 2023

aalyousfi commented Jul 23, 2023

jbrockmendel commented Sep 19, 2023

lithomas1 commented Sep 21, 2023

tgy commented Sep 21, 2023

ManuelNavarroGarcia commented Nov 22, 2023 • edited

n-splv commented Feb 12, 2024 • edited

bcrotty commented Mar 14, 2024 • edited

yuanx749 commented Apr 22, 2024

phofl commented Apr 11, 2023 •

edited by jorisvandenbossche

jorisvandenbossche commented May 31, 2023 •

edited

ManuelNavarroGarcia commented Nov 22, 2023 •

edited

n-splv commented Feb 12, 2024 •

edited

bcrotty commented Mar 14, 2024 •

edited