Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.update doesn't preserve dtypes #55509

Closed
3 tasks done
aaron-robeson-8451 opened this issue Oct 13, 2023 · 8 comments · Fixed by #57637
Closed
3 tasks done

BUG: DataFrame.update doesn't preserve dtypes #55509

aaron-robeson-8451 opened this issue Oct 13, 2023 · 8 comments · Fixed by #57637
Assignees
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@aaron-robeson-8451
Copy link
Contributor

aaron-robeson-8451 commented Oct 13, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df1 = pd.DataFrame({
    'idx': [1, 2],
    'val': [True]*2,
}).set_index('idx')

df2 = pd.DataFrame({
    'idx': [1],
    'val': [True],
}).set_index('idx')

print(df1.dtypes['val'])
print(df2.dtypes['val'])

df1.update(df2)

print(df1.dtypes['val'])

Issue Description

Related to issue #4094. When calling df1.update(df2) the datatype of column val is converted to object, despite the fact that it is type bool on both df1 and df2. This occurs when the index of df1 contains values not present in the index of df2.

Expected Behavior

The expected output of the example is:

bool
bool
bool

In practice the output is:

bool
bool
object

Installed Versions

INSTALLED VERSIONS

commit : e86ed37
python : 3.10.13.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Wed Jul 5 22:21:56 PDT 2023; root:xnu-8796.141.3~6/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.1.2
pip : 23.2.1
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.6
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.2
gcsfs : None
matplotlib : 3.7.3
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@aaron-robeson-8451 aaron-robeson-8451 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2023
@aaron-robeson-8451 aaron-robeson-8451 changed the title BUG: Update should preserve dtypes BUG: Update doesn't preserve dtypes Oct 13, 2023
@rhshadrach
Copy link
Member

Thanks for the report! Confirmed on main.

In the current implementation of DataFrame.update, we are reindexing the passed other, coercing its dtype. I wonder if we can instead modify the caller inplace (via loc) with an appropriate mask. Further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2023
MichaelTiemannOSC added a commit to MichaelTiemannOSC/pandas that referenced this issue Oct 17, 2023
Preserve dtype when updating from dataframe whose NA values don't affect original.

I don't know the best place to put the test case in the tests/frame or the tests/dtype directory.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@simonjayhawkins simonjayhawkins changed the title BUG: Update doesn't preserve dtypes BUG: DataFrame.update doesn't preserve dtypes Oct 17, 2023
@MichaelTiemannOSC
Copy link
Contributor

This is much trickier than I thought. In the bool case the udpate result comes back as object but the final values are all bool. But in the int case the updated results come back as float64 because the nan converts the reindexed array to float in order to hold the nan. In the string case the val column df1 starts with type object.

Here's the test case I'm constructing (in pandas/tests/frame/methods/test_update.py:

    @pytest.mark.parametrize(
        "value ,dtype",
        [
            (True, bool),
            (1, int),
            (np.uint64(2), np.dtype("uint64")),
            (3.0, float),
            (4.0 + 1j, complex),
            ("a", pd.StringDtype()),
            (pd.to_timedelta("1 ms"), np.dtype("timedelta64[ns]")),
            (np.datetime64("2000-01-01T00:00:00"), np.dtype("datetime64[ns]")),
        ],
    )
    def test_update_preserve_dtype(self, value, dtype):
	# GH#55509                                                                                                                                                                                                                  
        from pandas.core.dtypes.common import is_string_dtype

        df1 = pd.DataFrame(
            {
                "idx": [1, 2],
                "val": [value] * 2,
            }
        ).set_index("idx")

        df2 = pd.DataFrame(
            {
                "idx": [1],
                "val": [value],
            }
        ).set_index("idx")

        if is_string_dtype(dtype):
            # Should we test for string, given its default behavior?                                                                                                                                                                
            df1 = df1.astype("string")
            df2 = df2.astype("string")

        assert df1.dtypes["val"] == dtype
        assert df2.dtypes["val"] == dtype

        df1.update(df2)

        assert df1.dtypes["val"] == dtype

Amy I trying too hard?

@aureliobarbosa
Copy link
Contributor

Hey @MichaelTiemannOSC,

I have been studying the problem for a few days now and got a solution following the suggestions made above by @rhshadrach. As it changed the algorithm a little bit, I implemented a prototype and I still need to move it to the codebase. But it is already passing my own tests and the ones you suggested above (except for strings, which I think is more delicate as it involves checking a type and casting).

Would you mind if I take this issue and made another PR? I am open to discuss. Regards

@MichaelTiemannOSC
Copy link
Contributor

Feel free. We all work together 😊

@aureliobarbosa
Copy link
Contributor

take

@aureliobarbosa
Copy link
Contributor

It seems that I am ready to make a PR, but I have a question about the tests.

The proposed code is passing all tests regarding the DataFrame.update() method, and it also passed a longer test similar to the one proposed above by @MichaelTiemannOSC. But I am not sure whether I should include this longer test case or should I just create an smaller one testing only for the issue proposed by @aaron-robeson-8451?

@rhshadrach

@rhshadrach
Copy link
Member

More thorough testing, as long as it is not duplicative, is preferred.

aureliobarbosa added a commit to aureliobarbosa/pandas that referenced this issue Oct 20, 2023
aureliobarbosa pushed a commit to aureliobarbosa/pandas that referenced this issue Oct 22, 2023
…ndas-dev#55589)

* Update generic.py

* Update generic.py

* Update generic.py

Fix for issue pandas-dev#55509 (updating intersection only)

fix typo - black

Added test
@aureliobarbosa
Copy link
Contributor

@MichaelTiemannOSC

I confirm that reindexing is changing the dtype as mentioned and that this introduces problems when using reindex inside the update method. If update is rewritten without using reindex, those problems could be avoided (once the new code passes pull all tests, which is not the case yet).

The problem with reindex changing the dtype of the data is possibly related to issue #47503, and may deserve further studies.
Here follows a minimal example code which captures this in the current context (which seems simpler than that shown in #47503):

df = pd.DataFrame({"a": [2, 2]}, index=[0, 1], dtype=np.int64)
other = pd.DataFrame({"a": [1, 1]}, index=[1, 2], dtype=np.int64)
other_reindex = other.reindex(df.index)

print(other["a"].dtype, other_reindex["a"].dtype)

aureliobarbosa added a commit to aureliobarbosa/pandas that referenced this issue Nov 6, 2023
aureliobarbosa added a commit to aureliobarbosa/pandas that referenced this issue Nov 6, 2023
aureliobarbosa added a commit to aureliobarbosa/pandas that referenced this issue Nov 6, 2023
aureliobarbosa added a commit to aureliobarbosa/pandas that referenced this issue Nov 6, 2023
aureliobarbosa added a commit to aureliobarbosa/pandas that referenced this issue Nov 9, 2023
aureliobarbosa added a commit to aureliobarbosa/pandas that referenced this issue Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment