Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setting a whole column changes column dtype from "category" to "object" #22361

Open
teto opened this issue Aug 15, 2018 · 4 comments
Open

setting a whole column changes column dtype from "category" to "object" #22361

teto opened this issue Aug 15, 2018 · 4 comments
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@teto
Copy link

teto commented Aug 15, 2018

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto

# Your code here
class ConnectionRoles(Enum):
    Client = auto()
    Server = auto()

dtype_role = pd.api.types.CategoricalDtype(categories=list(ConnectionRoles), ordered=True)

df  = pd.DataFrame(columns=["tcpdest"])

df = df.astype(dtype= { "tcpdest": dtype_role })

print("before setting ", df.dtypes.tcpdest)
df['tcpdest'] = ConnectionRoles.Server

print("after setting", df.dtypes.tcpdest)

Problem description

Setting the whole serie changes the dtype from category to object which breaks my code afterwards.

Expected Output

before setting category
after setting category

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
Details

ommit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.24
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0.23.3
pytest: None
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.3
numpy: 1.14.5
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.6
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.4
xlrd: 0.9.4
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.3
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

This is the expected behaviour. When doing df['col'] = ... you overwrite the full column, not taking into account the existing one.
If you want to keep the dtype, you should only overwrite the values with eg df['col'][:] = ... or df.loc[:, 'col'] = ....

For example:

In [12]: df['tcpdest'][:] = ConnectionRoles.Server

In [13]: df.dtypes
Out[13]: 
tcpdest    category
dtype: object

@jorisvandenbossche
Copy link
Member

Apparently, the df.loc[:, 'tcpdest'] way does not preserve the dtype (the df['tcpdest'][:] still does), which was not what I expected.

Anyhow, I am also not sure this is very good documented (in general, when values are coerced to the dtype, or when dtypes are changed on updates etc)

teto added a commit to teto/pymptcpanalyzer that referenced this issue Aug 16, 2018
…to drop the syn/synack hence exporting options etc
@teto
Copy link
Author

teto commented Aug 19, 2018

I had tried df.loc[:, 'tcpdest'] with the same result. Didn\t know of df['tcpdest'][:] so thanks for the tip.

This is the expected behaviour.

As a user, I am very surprised by this behavior. Is there any rationale documented anywhere ? I wish there would be some warning at least since I guess it would be unexpected for most users.

@jorisvandenbossche
Copy link
Member

I understand that this can be surprising (even for me, as a experienced pandas developer/user, I can't always predict what the behaviour will be), but I think, depending on the use case, many users would also find the other way around surprising.
For example, say you have a dataframe with a column of datetime strings, and you convert them to actual datetime dtype with:

df['dates'] = pd.to_datetime(df['dates'])

Here, you don't want that the 'dates' column preserves its string dtype, and not overwriting the full column (and dtype) would be surprising I think (and in any case, changing this would break a lot of existing code).


That said, I find the df.loc[:, 'tcpdest'] somewhat surprising (or at least that makes that it is not that easy to select all values of the column).

Also, I suppose this can certainly be better documented. I don't directly know where it would be documented now.

@mroeschke mroeschke added the Indexing Related to indexing on series/frames, not to indexes themselves label Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

3 participants