Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Sparse[datetime64[ns]] TypeError: data type not understood #35762

Closed
2 of 3 tasks
sbrugman opened this issue Aug 17, 2020 · 5 comments · Fixed by #35838
Closed
2 of 3 tasks

BUG: Sparse[datetime64[ns]] TypeError: data type not understood #35762

sbrugman opened this issue Aug 17, 2020 · 5 comments · Fixed by #35838
Labels
Milestone

Comments

@sbrugman
Copy link
Contributor

  • I have checked that this issue has not already been reported (related, but different).

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
# Taken from the example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparsedtype
dtype = pd.SparseDtype(np.dtype('datetime64[ns]'))

# Some value
values = [np.datetime64('2012-05-01T01:00:00.000000'), np.datetime64('2016-05-01T01:00:00.000000')]

# Create the series
series = pd.Series(values, dtype=dtype)

Alternatively:

series = pd.Series(values, dtype="Sparse[datetime64[ns]]")

Problem description

As a user I would expect that datetime64[ns] is supported as SparseDtype for the SparseArray based on the Sparse data structures page in the documentation. This is desireable for the same rationale as supporting other sparse types, since date(time)s can also contain mostly NaT values that the users wants to store efficiently.

Running the code above yields the following TypeError:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-52-9fc6a02721e3> in <module>
----> 1 series = pd.Series(values, dtype=dtype)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    325                     data = data.copy()
    326             else:
--> 327                 data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
    328 
    329                 data = SingleBlockManager.from_array(data, index)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\construction.py in sanitize_array(data, index, dtype, copy, raise_cast_failure)
    439     elif isinstance(data, (list, tuple)) and len(data) > 0:
    440         if dtype is not None:
--> 441             subarr = _try_cast(data, dtype, copy, raise_cast_failure)
    442         else:
    443             subarr = maybe_convert_platform(data)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\construction.py in _try_cast(arr, dtype, copy, raise_cast_failure)
    551             subarr = arr
    552         else:
--> 553             subarr = maybe_cast_to_datetime(arr, dtype)
    554 
    555         # Take care in creating object arrays (but iterators are not

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in maybe_cast_to_datetime(value, dtype, errors)
   1327                 # pandas supports dtype whose granularity is less than [ns]
   1328                 # e.g., [ps], [fs], [as]
-> 1329                 if dtype <= np.dtype("M8[ns]"):
   1330                     if dtype.name == "datetime64":
   1331                         raise ValueError(msg)

TypeError: data type not understood

Expected Output

The expected output is for Sparse[datetime[ns]] similar to what one would expect from other dtypes:

>>> series.values
[2012-05-01 01:00:00, 2018-05-01 01:00:00]
 Fill: NaT
 IntIndex
 Indices: array([0, 1])

>>> series.dtype
Sparse[datetime64[ns], NaT]

Workaround

Note that the following workaround partially gives the expected results:

series = pd.Series(pd.arrays.SparseArray(values))

For which these operations work as expected:

>>> series.values
[2012-05-01 01:00:00, 2018-05-01 01:00:00]
 Fill: NaT
 IntIndex
 Indices: array([0, 1])

>>> series.dtype
Sparse[datetime64[ns], NaT]

However series.head() will throw an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

C:\ProgramData\Anaconda3\lib\site-packages\IPython\lib\pretty.py in pretty(self, obj)
    400                         if cls is not object \
    401                                 and callable(cls.__dict__.get('__repr__')):
--> 402                             return _repr_pprint(obj, self, cycle)
    403 
    404             return _default_pprint(obj, self, cycle)

C:\ProgramData\Anaconda3\lib\site-packages\IPython\lib\pretty.py in _repr_pprint(obj, p, cycle)
    695     """A pprint that just redirects to the normal repr function."""
    696     # Find newlines and replace them with p.break_()
--> 697     output = repr(obj)
    698     for idx,output_line in enumerate(output.splitlines()):
    699         if idx:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in __repr__(self)
   1319             min_rows=min_rows,
   1320             max_rows=max_rows,
-> 1321             length=show_dimensions,
   1322         )
   1323         result = buf.getvalue()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
   1384             max_rows=max_rows,
   1385         )
-> 1386         result = formatter.to_string()
   1387 
   1388         # catch contract violations

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\format.py in to_string(self)
    356 
    357         fmt_index, have_header = self._get_formatted_index()
--> 358         fmt_values = self._get_formatted_values()
    359 
    360         if self.truncate_v:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\format.py in _get_formatted_values(self)
    345             None,
    346             float_format=self.float_format,
--> 347             na_rep=self.na_rep,
    348         )
    349 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting)
   1177     )
   1178 
-> 1179     return fmt_obj.get_result()
   1180 
   1181 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\format.py in get_result(self)
   1208 
   1209     def get_result(self) -> List[str]:
-> 1210         fmt_values = self._format_strings()
   1211         return _make_fixed_width(fmt_values, self.justify)
   1212 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\format.py in _format_strings(self)
   1465 
   1466         if not isinstance(values, DatetimeIndex):
-> 1467             values = DatetimeIndex(values)
   1468 
   1469         if self.formatter is not None and callable(self.formatter):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\datetimes.py in __new__(cls, data, freq, tz, normalize, closed, ambiguous, dayfirst, yearfirst, dtype, copy, name)
    277             dayfirst=dayfirst,
    278             yearfirst=yearfirst,
--> 279             ambiguous=ambiguous,
    280         )
    281 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\datetimes.py in _from_sequence(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
    321             dayfirst=dayfirst,
    322             yearfirst=yearfirst,
--> 323             ambiguous=ambiguous,
    324         )
    325 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\datetimes.py in sequence_to_dt64ns(data, dtype, copy, tz, dayfirst, yearfirst, ambiguous)
   1957     if is_datetime64tz_dtype(data_dtype):
   1958         # DatetimeArray -> ndarray
-> 1959         tz = _maybe_infer_tz(tz, data.tz)
   1960         result = data._data
   1961 

AttributeError: 'SparseArray' object has no attribute 'tz'

Expected:

0   2012-05-01 01:00:00
1   2018-05-01 01:00:00
dtype: Sparse[datetime64[ns]]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.0
numpy : 1.19.1
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.4.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fsspec : 0.5.2
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.14.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.6.0
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.45.1

@sbrugman sbrugman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 17, 2020
@jreback
Copy link
Contributor

jreback commented Aug 17, 2020

you are welcome to add support
currently this dtype for sparse has almost 0 support / tests

@jreback jreback added Enhancement Sparse Sparse Data Type Timeseries and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 17, 2020
@jreback jreback added this to the Contributions Welcome milestone Aug 17, 2020
@dsaxton
Copy link
Member

dsaxton commented Aug 21, 2020

The second error triggered after the workaround looks to be a regression caused by #33400.

cc @jbrockmendel

@simonjayhawkins
Copy link
Member

However series.head() will throw an error:

the error is not just for head(), any call to __repr__ probably raises

>>> pd.__version__
'1.2.0.dev0+128.gdb6414fbe'
>>>
>>> values = [np.datetime64('2012-05-01T01:00:00.000000'), np.datetime64('2016-05-01T01:00:00.000000')]
>>>
>>> series = pd.Series(pd.arrays.SparseArray(values))
>>> series
Traceback (most recent call last):
...
AttributeError: 'SparseArray' object has no attribute 'tz'
>>>
>>> series.values
[2012-05-01 01:00:00, 2016-05-01 01:00:00]
Fill: NaT
IntIndex
Indices: array([0, 1])

>>>
>>> series.dtype
Sparse[datetime64[ns], NaT]
>>>
>>> series.head()
Traceback (most recent call last):
...
AttributeError: 'SparseArray' object has no attribute 'tz'
>>>

The second error triggered after the workaround looks to be a regression caused by #33400.

this error persists with the current changes in #35838

>>> pd.__version__
'1.2.0.dev0+121.gf4e1debe9'
>>>
>>> # Taken from the example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparsedtype
>>> dtype = pd.SparseDtype(np.dtype('datetime64[ns]'))
>>>
>>> # Some value
>>> values = [np.datetime64('2012-05-01T01:00:00.000000'), np.datetime64('2016-05-01T01:00:00.000000')]
>>>
>>> # Create the series
>>> series = pd.Series(values, dtype=dtype)
>>>
>>> series
Traceback (most recent call last):
...
AttributeError: 'SparseArray' object has no attribute 'tz'
>>>

@dsaxton I think we should maybe raise a separate issue for this (and mark as regression). wdyt?

@dsaxton
Copy link
Member

dsaxton commented Aug 21, 2020

@simonjayhawkins I think so too, since it's a different bug than the first error

@simonjayhawkins
Copy link
Member

@simonjayhawkins I think so too, since it's a different bug than the first error

issue raised. xref #35843 Thanks @dsaxton

@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.1.2 Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants