Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in df.update #29462

Open
isaacgg opened this issue Nov 7, 2019 · 2 comments
Open

Possible bug in df.update #29462

isaacgg opened this issue Nov 7, 2019 · 2 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Timeseries

Comments

@isaacgg
Copy link

isaacgg commented Nov 7, 2019

Code Sample, a copy-pastable example if possible

test_json = [{"_id": 'a', 'date': datetime.now()}, {"_id": 'b', 'date': datetime.now()}]
test_df = pd.DataFrame(test_json)

new_df = test_df.copy()
new_df["date"] = None
new_df.update(test_df)

print(test_df.head())
print(new_df.head())

Problem description

When using update function with datetime data, it is automatically converted to timestamp, which for me it seems like an abnormal behaviour. Code from above would output

_id date
0 a 2019-11-07 15:50:06.072158
1 b 2019-11-07 15:50:06.072158
_id date
0 a 1573141806072158000
1 b 1573141806072158000

Expected Output

_id date
0 a 2019-11-07 15:50:06.072158
1 b 2019-11-07 15:50:06.072158
_id date
0 a 2019-11-07 15:50:06.072158
1 b 2019-11-07 15:50:06.072158

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : None
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@susan-shu-c
Copy link

susan-shu-c commented Nov 13, 2019

Hi, I was able to reproduce this result. This is due to pandas.DataFrame.update calling expressions.where. source link.

From then it eventually calls numpy.where documentation which then eventually uses the Numpy MaskedArray type. source link.

This seems to be a choice to use numpy.where, which causes the datetime type to be converted to unix time, which speeds up the computation. However feel free to correct me on that.

I'd suggest trying pandas.to_datetime linked here to convert them afterward (sometimes you have to reduce your unix time precision, by removing digits from the end, to get it to work), but I haven't tested on your example data yet, so feel free to try it.

@simonjayhawkins simonjayhawkins added Needs Triage Issue that has not been reviewed by a pandas team member Timeseries labels Apr 24, 2020
@rhshadrach
Copy link
Member

Indeed, this appears to be an odd interaction with DatetimeArray and np.where.

a = np.asarray([None], dtype=np.object)
b = np.asarray(pd.arrays.DatetimeArray(pd.Series([datetime.now()])))
cond = [False]
print(np.where(cond, a, b))

gives [1595782471507905000]; whereas

a = np.asarray([None], dtype=np.object)
b = np.asarray([datetime.now()], dtype=np.object)
cond = [False]
print(np.where(cond, a, b))

gives [datetime.datetime(2020, 7, 26, 16, 53, 4, 806281)]

@rhshadrach rhshadrach added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jul 26, 2020
@TomAugspurger TomAugspurger removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 4, 2020
@mroeschke mroeschke added the Bug label Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Timeseries
Projects
None yet
Development

No branches or pull requests

6 participants