Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: set_index on more than 1 column changes boolean values #46987

Open
3 tasks done
alonme opened this issue May 10, 2022 · 9 comments
Open
3 tasks done

BUG: set_index on more than 1 column changes boolean values #46987

alonme opened this issue May 10, 2022 · 9 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug hashing hash_pandas_object MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@alonme
Copy link
Contributor

alonme commented May 10, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# create df with booleans and 0,1
df = pd.DataFrame({'group_a': [1,2,1,2], 'group_b': [0,1,True,False], 'value':range(4)})

print(df)

# True and False were changed
print(df.set_index(['group_a', 'group_b']))

# Problem doesn't happen if 0/1 are not present in the column
df = pd.DataFrame({'group_a': [1,2,1,2], 'group_b': [2,3,True,False], 'value':range(4)})

print(df)

print(df.set_index(['group_a', 'group_b']))

Issue Description

set_index on more than 1 column may change the values of booleans to integers/floats.

this issue only happens if (0, 0.0, 1, 1.0) are found in the same column as (True/False).

Expected Behavior

set_index should not change values

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.8.7.final.0
python-bits : 64
OS : Darwin
OS-release : 20.4.0
Version : Darwin Kernel Version 20.4.0: Fri Mar 5 01:14:14 PST 2021; root:xnu-7195.101.1~3/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.3.0
Cython : 0.29.28
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.3.0
gcsfs : None
markupsafe : 2.1.1
matplotlib : 3.5.2
numba : 0.53.1
numexpr : 2.8.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 6.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2022.02.0
scipy : 1.8.0
snappy : None
sqlalchemy : None
tables : 3.7.0
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

@alonme alonme added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2022
@simonjayhawkins simonjayhawkins added MultiIndex hashing hash_pandas_object Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 12, 2022
@alonme
Copy link
Contributor Author

alonme commented Sep 30, 2022

The issue arises from the implementation of PyObjectHashTable, which in turn uses the python builtin hash.

In cpython - False and True have the same hash values as 0 and 1
image

Which in turn effects sets and dictionaries

image

This rises the question if this issue should be handled or not,
Personally - i do not like this behavior.

This following example of code will give some more context into my original use case, which i believe shows that this behavior is not ideal.

image

@simonjayhawkins - what do you think?
Should i look at a way to implement a fix?

I assume that we can create a special behavior for booleans (actually the same happens with numpy.bool_) in PyObjectHashTable

@alonme
Copy link
Contributor Author

alonme commented Oct 1, 2022

The issue is more general than only "set_index on more than 1 column"

image

@rhshadrach
Copy link
Member

A pandas index should be thought of as a hash map (e.g. a dictionary). With this mental model, it would be unexpected if two objects were equal but not treated as the same key in a hash map. I agree with the stance that this is undesired behavior, but I think it is undesired behavior of Python rather than pandas. I am concerned that going against how Python behaves will only lead to further issues or confusion.

I am also surprised to find that NumPy also exhibits this behavior:

print(np.bool_(True) == np.int_(1))
# True

@alonme
Copy link
Contributor Author

alonme commented Nov 23, 2022

@rhshadrach - thanks!
what should be our next step for this issue?

@rhshadrach
Copy link
Member

rhshadrach commented Nov 23, 2022

I think this needs more discussion; my current thought is that while this behavior is undesirable, modifying it may be more undesirable.

Another example. Trying to index using a Boolean value is specifically handled by pandas, but float is not.

df = pd.DataFrame({'a': [1, 0, 2], 'b': [3, 4, 5]}).set_index('a')

print(df)
#    b
# a   
# 1  3
# 0  4
# 2  5

print(df.loc[True])
# KeyError: 'True: boolean label can not be used without a boolean index'

print(df.loc[1.0])
# b    3
# Name: 1, dtype: int64

@alonme
Copy link
Contributor Author

alonme commented Nov 23, 2022

OK.

Would gladly help to find more places in which we have this behavior / implement a fix

@buhtz
Copy link

buhtz commented Feb 1, 2023

Before opening another Issue I would like the following behaviour depends also on the same problem (PyObjectHashTable) described in that Issue here.

Creating a Pseudo-MultiIndex with tuples in an Index is a solution I was pointed to here: https://stackoverflow.com/a/75309326/4865723

But an Index with tuples in it instead of a real MultiIndex does create some other problems for me.

The problem described with that MWE is that after converting the Index into a MultiIndex the integers 0 and 1 in the second level of that index are converted into booleans.

#!/usr/bin/env python3
import pandas

df = pandas.DataFrame({
    'idx1': list('AABB'),
    'idx2': [False, True, 0, 1],
    'vals': range(4)
})

df = df.set_index(pandas.Index(df[['idx1', 'idx2']])).drop(columns=['idx1', 'idx2'])

print(df.to_markdown())
# |              |   vals |
# |:-------------|-------:|
# | ('A', False) |      0 |
# | ('A', True)  |      1 |
# | ('B', 0)     |      2 |
# | ('B', 1)     |      3 |

print(df.index)
# -> Index([('A', False), ('A', True), ('B', 0), ('B', 1)], dtype='object')

multi_idx = pandas.Index(df.index.to_list())
print(multi_idx)
# MultiIndex([('A', False),
#             ('A',  True),
#             ('B', False),
#             ('B',  True)],
#            )

@alonme
Copy link
Contributor Author

alonme commented Jul 20, 2023

Hey @rhshadrach,
Any news regarding this issue? this keeps making trouble for me
I understand the complexity around this - but i hope we can figure out a solution - maybe we can disable that using some flag / configuration / context ?

@rhshadrach
Copy link
Member

maybe we can disable that using some flag / configuration / context ?

I think I'm -1 here. You're asking pandas to differentiate two Python objects when Python itself does not.

print({True: 1, 1: 2})
# {True: 2}

It seems to me that this is going to lead to special casing and complexities in the pandas code that provide little value.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug hashing hash_pandas_object MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants