Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where method does not properly handle values with datetimes with TZ #21546

Closed
Liam3851 opened this issue Jun 19, 2018 · 7 comments · Fixed by #21660
Closed

Where method does not properly handle values with datetimes with TZ #21546

Liam3851 opened this issue Jun 19, 2018 · 7 comments · Fixed by #21660
Labels
Bug Timezones Timezone data dtype
Milestone

Comments

@Liam3851
Copy link
Contributor

Code Sample, a copy-pastable example if possible

Series input:

    import pandas as pd
    dts1 = pd.date_range('20150101', '20150105', tz='America/New_York')
    df1 = pd.DataFrame({'date': dts1})
    dts2 = pd.date_range('20150103', '20150107', tz='America/New_York')
    df2 = pd.DataFrame({'date': dts2})
    ser_result = df1.date.where(df1.date < df1.date[3], df2.date)
    ser_result

Series output:

0    2015-01-01 00:00:00-05:00
1    2015-01-02 00:00:00-05:00
2    2015-01-03 00:00:00-05:00
3          1420520400000000000
4          1420606800000000000
Name: date, dtype: object

DataFrame input:

import pandas as pd                                                                  
dts1 = pd.date_range('20150101', '20150105', tz='America/New_York')                  
df1 = pd.DataFrame({'date': dts1, 'x':np.arange(5), 'y':dts1.tz_localize(None)})     
dts2 = pd.date_range('20150103', '20150107', tz='America/New_York')                  
df2 = pd.DataFrame({'date': dts2, 'x':np.arange(3, 8), 'y':dts2.tz_localize(None)})  
mask = pd.DataFrame(True, index=df1.index, columns=df2.columns)                      
mask.iloc[3:] = False                                                                
df_result = df1.where(mask, df2)             
df_result                                           

DataFrame output:

                       date  x          y
0 2015-01-01 00:00:00-05:00  0 2015-01-01
1 2015-01-02 00:00:00-05:00  1 2015-01-02
2 2015-01-03 00:00:00-05:00  2 2015-01-03
3 2015-01-04 00:00:00-05:00  6 2015-01-06
4 2015-01-05 00:00:00-05:00  7 2015-01-07
  2015-01-01 00:00:00-05:00
  2015-01-02 00:00:00-05:00
  2015-01-03 00:00:00-05:00
  2015-01-04 00:00:00-05:00
  2015-01-05 00:00:00-05:00
  2015-01-01 00:00:00-05:00
  2015-01-02 00:00:00-05:00
  2015-01-03 00:00:00-05:00
  2015-01-04 00:00:00-05:00
  2015-01-05 00:00:00-05:00
  2015-01-03 00:00:00-05:00
  2015-01-04 00:00:00-05:00
  2015-01-05 00:00:00-05:00
  2015-01-06 00:00:00-05:00
  2015-01-07 00:00:00-05:00
  2015-01-03 00:00:00-05:00
  2015-01-04 00:00:00-05:00
  2015-01-05 00:00:00-05:00
  2015-01-06 00:00:00-05:00
  2015-01-07 00:00:00-05:00

Problem description

where fails on both Series and DataFrame when given values that have datetime-tz dtype. Both work fine with naive datetime values. Series ends up with a mix of datetime-tz and what appear to be i8 values; datetime-tz columns in DataFrames end up with the wrong shape (perhaps a concat along the wrong axis is occurring?).

Expected Output

Series:

0    2015-01-01 00:00:00-05:00
1    2015-01-02 00:00:00-05:00
2    2015-01-03 00:00:00-05:00
3    2015-01-06 00:00:00-05:00
4    2015-01-07 00:00:00-05:00
Name: date, dtype: datetime64[ns, America/New_York]

DataFrame:

                       date  x          y
0 2015-01-01 00:00:00-05:00  0 2015-01-01
1 2015-01-02 00:00:00-05:00  1 2015-01-02
2 2015-01-03 00:00:00-05:00  2 2015-01-03
3 2015-01-06 00:00:00-05:00  6 2015-01-06
4 2015-01-07 00:00:00-05:00  7 2015-01-07

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 2f4d393
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.0.dev0+113.g263386389
pytest: 3.6.0
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.6
IPython: 6.4.0
sphinx: 1.7.5
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.5
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: 0.8.1
psycopg2: None
jinja2: 2.10
s3fs: 0.1.5
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

@mroeschke
Copy link
Member

xrefing my #21469 (comment) and probably the root cause for #21469

@mroeschke mroeschke added Bug Timezones Timezone data dtype labels Jun 20, 2018
@Liam3851
Copy link
Contributor Author

@mroeschke Do you have an idea why there's the weirdly shaped result from DataFrame.where with a tz column? I think you're more familiar with the internals here than I am.

@Liam3851
Copy link
Contributor Author

Liam3851 commented Jun 20, 2018

@mroeschke I think I figured out at least the weirdly-shaped result issue here (not sure about the 1-D case). As indicated at your comment on #21469 the issue is in Block.where. The current implementation implicitly assumes that Block.values for a DataFrame is a 2-D ndarray. However, if the Block is DatetimeTZ-type, Block.values gives back a (1-D) DatetimeIndex with the correct TZ. This in turn is apparently confusing the where function into thinking broadcasting is required.

@mroeschke
Copy link
Member

Thanks for investigating @Liam3851.

IIRC DatetimeTZBlocks only contain data (.values) in 1D because tz may not be conserved across both axis in a 2D sense e.g. one column has UTC data and other has US/Eastern data. This may be remedied once ExtentionArrays for DatetimeTZ data are implemented, Otherwise, this may need to be patched by defining or patching a where method for DatetimeTZBlocks

@Liam3851
Copy link
Contributor Author

Liam3851 commented Jun 26, 2018

@mroeschke Been digging into this a bit further. Do you know, what is the contract regarding the dimension of Block.values, Block.get_values, and Block.internal_values? It's not entirely clear from the pandas.core.internals source.

In particular, NonConsolidatableMixin.get_values has the comment:

''' need to to_dense myself (and always return a ndim sized object) '''

DatetimeTZBlock overrides NonConsolidatableMixin.get_values, and has 1-dimensional get_values/values (a DatetimeIndex), but for blocks inside a DataFrame, self.ndim==2. That seems to violate the comment above, but maybe that's more a description than a required contract for subclasses?

Similarly in DatetimeTZBlock._try_coerce_args, right now we have a call:

values = _block_shape(values.asi8, ndim=self.ndim)

which ensures that values is self.ndim (2) dimensions. However, if other is also a DatetimeIndex we just get the 1-dimensional values of other. Seems clear these should be consistent, just not sure whether in general _try_coerce_args should always return something

  1. the same dimension/type as self.values (internal), or
  2. the same dimension as self.ndim

Thoughts?

@mroeschke
Copy link
Member

@jreback architected the DatetimeTZBlock (and feel free to correct me), but from what I understand, DatetimeTZBlock holds its .values as a 1D array (i.e. DatetimeIndex) while other block types can hold its values in 1 or 2D arrays (numpy arrays mostly).

From what I understand at a high level for DataFrames, it's composed from a group of Blocks that groups the values of the DataFrame by dtype along the columns. In most cases these Blocks can store 2D values, so overall the collection (concatination) of 2D values from the Blocks comprise the overall 2D dimension of the DataFrame.

The reason why DatetimeTZBlock doesn't hold 2D arrays is that the timezone is assumed to be consistent along the columns and not necessarily the rows; therefore, you need 1 DatetimeTZBlock per DataFrame column with a Datetimetz dtype. Other block types AFAIK don't have similar limitations so 1 Block can hold values for multiple DataFrame columns.

Beyond that, I am not familiar with how the internals makes sure the dimensions line up or routines that would coerce DatetimeTZBlock.values from 1D to 2D (and if there were it would probably strip the tz information)

@jreback jreback added this to the 0.24.0 milestone Jun 26, 2018
@jreback
Copy link
Contributor

jreback commented Jun 26, 2018

@mroeschke expl is correct. yeah its currently a bummer that we aren't super consistent internally how we handle naive datetimes (2D) vs tz-aware (1D), but that's the direction we have been moving. So various compat things exist to handle this.

We basically have too many accessors that do slightly different things, but the cleanup is non-trivial / hard. For this issue, prob just need to see where its getting coerced to 2D (or vice-versa) and maybe override the method in DatetimeTZBlock or put comething that can generically handle.

here's a patch which works on the above example. doesn't seem to break any series tests, but have a go. (breaks something in dataframe tests :<)

(pandas) bash-3.2$ git diff
diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index fe508dc1b..f2b732b57 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -1495,7 +1495,7 @@ class Block(PandasObject):
                 return values
 
             values, values_mask, other, other_mask = self._try_coerce_args(
-                values, other)
+                values, orig_other)
 
             try:
                 return self._try_coerce_result(expressions.where(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants