Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: scalar assignment of a tz-aware is object dtype #19843

Closed
jreback opened this issue Feb 22, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@jreback
Copy link
Contributor

commented Feb 22, 2018

[3] should be a datetime64[ns, UTC]

In [1]: df = pd.DataFrame({'A': [0, 1]})

In [3]: df['now'] = pd.Timestamp('20130101', tz='UTC')

In [4]: df
Out[4]: 
   A                        now
0  0  2013-01-01 00:00:00+00:00
1  1  2013-01-01 00:00:00+00:00

In [5]: df.dtypes
Out[5]: 
A       int64
now    object
dtype: object

In [6]: df['now2'] = pd.DatetimeIndex([pd.Timestamp('20130101', tz='UTC')]).repeat(len(df))

In [7]: df.dtypes
Out[7]: 
A                     int64
now                  object
now2    datetime64[ns, UTC]
dtype: object
@DylanDmitri

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2018

I will try and fix this.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Feb 23, 2018

great!

@DylanDmitri

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2018

Currently, infer_dtype_from_scalar (on datetimey/timestampy objects) returns a np.datetime64 if no timezone is given, and defaults to np.object_ on objects with timezones. Fixing this problem means returning something else, rather than np.object_.

Ideally return DatetimeTZDtypeType. However, this crashes on np.empty(shape, dtype=dtype) in cast_scalar_to_array. Seems like this should work, but it doesn't.

Quick fix is returning np.datetime64 rather than np.object_. You lose the timezone name, but numpy applies the correct offset before saving so the numbers are correct. This change doesn't break any tests, and results in the following behavior:

In [1]: df = pd.DataFrame({'A': [0, 1]})

In [3]: df['now'] = pd.Timestamp('20130101', tz='UTC')

In [5]: df.dtypes
Out[5]: 
A               int64
now    datetime64[ns]
dtype: object

In [6]: df['now2'] = pd.DatetimeIndex([pd.Timestamp('20130101', tz='UTC')]).repeat(len(df))

In [7]: df.dtypes
Out[7]: 
A                     int64
now          datetime64[ns]
now2    datetime64[ns, UTC]
dtype: object

Raises some inconsistencies, potentially problems with mixing in timezone-naive datetimes. Is the quick fix good enough?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Feb 23, 2018

@DylanDmitri you don't want to ever have numpy deal with timezones, they are completely wrong. infer_dtype_from_scalar has a pandas_dtype parameter that will make this work. We should actully just change this to do this by default (though this might break other things)

@DylanDmitri

This comment has been minimized.

Copy link
Contributor

commented Mar 2, 2018

Been busy the last week, sorry. Here's the problem code (from line 2874 of frame.py)

# BEFORE
value = cast_scalar_to_array(len(self.index), value)
value = maybe_cast_to_datetime(value, value.dtype)

Main issue: cast_scalar_to_array defaults to dtype np.object_, which is then ignored by maybe_cast_to_datetime. Want to capture the real pandas dtype, and then pass that into maybe_cast_to_datetime, which then works properly.

# AFTER
from pandas.core.dtypes.cast import infer_dtype_from_scalar
pandas_dtype, _ = infer_dtype_from_scalar(value, pandas_dtype=True)

value = cast_scalar_to_array(len(self.index), value)
value = maybe_cast_to_datetime(value, pandas_dtype)

This fixes the problem. Will check tests, and have a PR soon.

DylanDmitri added a commit to DylanDmitri/pandas that referenced this issue Mar 2, 2018

@DylanDmitri DylanDmitri referenced this issue Mar 2, 2018

Merged

fix: scalar timestamp assignment (#19843) #19973

4 of 4 tasks complete

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

jreback added a commit that referenced this issue Aug 2, 2018

minggli added a commit to minggli/pandas that referenced this issue Aug 5, 2018

merge master
* master: (47 commits)
  Run tests in conda build [ci skip] (pandas-dev#22190)
  TST: Check DatetimeIndex.drop on DST boundary (pandas-dev#22165)
  CI: Fix Travis failures due to lint.sh on pandas/core/strings.py (pandas-dev#22184)
  Documentation: typo fixes in MultiIndex / Advanced Indexing (pandas-dev#22179)
  DOC: added .join to 'see also' in Series.str.cat (pandas-dev#22175)
  DOC: updated Series.str.contains see also section (pandas-dev#22176)
  0.23.4 whatsnew (pandas-dev#22177)
  fix: scalar timestamp assignment (pandas-dev#19843) (pandas-dev#19973)
  BUG: Fix get dummies unicode error (pandas-dev#22131)
  Fixed py36-only syntax [ci skip] (pandas-dev#22167)
  DEPR: pd.read_table (pandas-dev#21954)
  DEPR: Removing previously deprecated datetools module (pandas-dev#6581) (pandas-dev#19119)
  BUG: Matplotlib scatter datetime (pandas-dev#22039)
  CLN: Use public method to capture UTC offsets (pandas-dev#22164)
  implement tslibs/src to make tslibs self-contained (pandas-dev#22152)
  Fix categorical from codes nan 21767 (pandas-dev#21775)
  BUG: Better handling of invalid na_option argument for groupby.rank(pandas-dev#22124) (pandas-dev#22125)
  use memoryviews instead of ndarrays (pandas-dev#22147)
  Remove depr. warning in SeriesGroupBy.count (pandas-dev#22155)
  API: Default to_* methods to compression='infer' (pandas-dev#22011)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.