Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_datetime inconsistent output container type #28629

Open
fjetter opened this issue Sep 26, 2019 · 7 comments
Open

BUG: to_datetime inconsistent output container type #28629

fjetter opened this issue Sep 26, 2019 · 7 comments
Labels
API - Consistency Internal Consistency of API/Behavior Bug Categorical Categorical Data Type Timeseries

Comments

@fjetter
Copy link
Member

fjetter commented Sep 26, 2019

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> type(pd.to_datetime(pd.Categorical(['2019-09-24 17:00:00']*50)))
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
>>> type(pd.to_datetime(pd.Categorical(['2019-09-24 17:00:00']*51)))
<class 'pandas.core.indexes.category.CategoricalIndex'>

Problem description

The output container type is inconsistent for pd.to_datetime and flips once the series is above 50 rows. This is an issue since the CategoricalIndex class does not support DateTime-like operations (tz_localize, stftime, etc.).
Since the behaviour switches between differently sized Series objects this is particularly troublesome since code behaves quite different depending on the amount of data we feed it (e.g. unit test vs. production).

Expected Output

Consistent container type.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.184-linuxkit
machine : x86_64
processor :
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.0
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 39.1.0
Cython : None
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.0.5
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 2.5.4
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
s3fs : None
scipy : None
sqlalchemy : 1.3.7
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : 1.0.5

@dongho-jung
Copy link
Contributor

I tried like this

@@ -215,7 +215,7 @@ def _convert_and_box_cache(
     """
     from pandas import Series

-    result = Series(arg).map(cache_array)
+    result = Series(arg).map(cache_array).astype('datetime64')
     if box:
         return _box_as_indexlike(result, utc=None, name=name)
     return result.values

and I'm on the testing 25%

@dongho-jung
Copy link
Contributor

the branch of those two cases is pandas/core/tools/datetimes.py:786 and because you define index as Categorical, dt_array won't be converted as DatetimeIndex. However, this way isn't what we want.
So I cast the argument to 'datetime64' before it is passed into _box_as_indexlike.

def _box_as_indexlike(
dt_array: ArrayLike, utc: Optional[bool] = None, name: Optional[str] = None
) -> Union[ABCIndex, ABCDatetimeIndex]:
"""
Properly boxes the ndarray of datetimes to DatetimeIndex
if it is possible or to generic Index instead
Parameters
----------
dt_array: 1-d array
array of datetimes to be boxed
tz : object
None or 'utc'
name : string, default None
Name for a resulting index
Returns
-------
result : datetime of converted dates
- DatetimeIndex if convertible to sole datetime64 type
- general Index otherwise
"""
from pandas import DatetimeIndex, Index
if is_datetime64_dtype(dt_array):
tz = "utc" if utc else None
return DatetimeIndex(dt_array, tz=tz, name=name)
return Index(dt_array, name=name)

@dongho-jung
Copy link
Contributor

nevermind! it spews out a lot of errors

@mroeschke
Copy link
Member

You hit this condition:

if len(arg) <= 50:

In addition to cache=True becoming the default in v0.25. This behavior definitely looks buggy nonetheless.

@dsm054
Copy link
Contributor

dsm054 commented Nov 30, 2019

I just hit a variant of the same something's-weird-with-the-cache logic bug. The real-world case was a lot less obviously silly than this, but:

In [191]: s = pd.Series([np.nan] * 1 + [pd.NaT]*50, dtype=object)                                                                          

In [192]: s.index                                                                                                                          
Out[192]: RangeIndex(start=0, stop=51, step=1)

In [193]: s.index.is_unique                                                                                                                
Out[193]: True

In [194]: pd.to_datetime(s)                                                                                                                
---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-194-db78efd4cda4> in <module>
----> 1 pd.to_datetime(s)
[...]
InvalidIndexError: Reindexing only valid with uniquely valued Index objects

where a total length of 50 works fine. As expected, setting cache=False makes things work again.

@simonjayhawkins simonjayhawkins added the Categorical Categorical Data Type label Jun 9, 2020
@jbrockmendel jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Sep 21, 2020
@lukemanley
Copy link
Member

Any thoughts on which is correct behavior here: Categorical -> Categorical (maintain container type) or Categorical -> DatetimeIndex?

@mroeschke
Copy link
Member

I would say returning DatetimeIndex is the more-correct behavior, but I could be convinced otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Categorical Categorical Data Type Timeseries
Projects
None yet
Development

No branches or pull requests

7 participants