New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stripped timedelta[s] series represented as timedelta[ns] #12425

Closed
toobaz opened this Issue Feb 23, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@toobaz
Member

toobaz commented Feb 23, 2016

In [2]: s = pd.Series(range(61)).astype('timedelta64[s]')

In [3]: s.tail(1)
Out[3]: 
60   00:01:00
dtype: timedelta64[s]

In [4]: str(s).splitlines()[-2]
Out[4]: '60   00:00:00.000000'

... because the snipped representation interprets the content of the series as nanoseconds rather than milliseconds. The same happens with timedelta[ms] and probably any other resolution (if there are).

In [5]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: 404819358e90da57c8025a259ab58cd75426069f
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.3.0-1-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: it_IT.utf8

pandas: 0.18.0rc1+35.g4048193
nose: 1.3.6
pip: 1.5.6
setuptools: 20.1.1
Cython: 0.23.2
numpy: 1.10.0.post2
scipy: 0.16.0
statsmodels: 0.8.0.dev0+755fa81
xarray: None
IPython: 4.1.1
sphinx: 1.3.1
patsy: 0.3.0-dev
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
matplotlib: 1.5.dev1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: 4.4.0
html5lib: 0.999
httplib2: 0.9.1
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 23, 2016

The same example, but shown in another way:

In [27]: s = pd.Series(range(5)).astype('timedelta64[s]')

In [28]: s
Out[28]:
0   00:00:00
1   00:00:01
2   00:00:02
3   00:00:03
4   00:00:04
dtype: timedelta64[s]

In [29]: pd.options.display.max_rows = 4

In [30]: s
Out[30]:
0          00:00:00
1   00:00:00.000000
          ...
3   00:00:00.000000
4   00:00:00.000000
dtype: timedelta64[ns]
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 23, 2016

Possibly related with #11594

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 23, 2016

The problem actually lies in (the usage of) concat:

In [38]: pd.concat([s[0:2], s[-2:]])
Out[38]:
0          00:00:00
1   00:00:00.000000
3   00:00:00.000000
4   00:00:00.000000
dtype: timedelta64[ns]
@jreback

This comment has been minimized.

Contributor

jreback commented Feb 23, 2016

well this is a different issue.

.astype('timedelta64[s]') is intepreting the timedelta64[s] units as ns so there needs to be a conversion done in ``core.common._possibly_convert_to_datetimelike` or maybe in the astype itself. So i'll mark this bug as one like that. Note this feature is not technically supported.

e.g. we reject things like this (so we can either reject, or better yet accept both). We do this for datetime64, IOW, we accept multiple dtypes (and then convert).

In [1]: Series(np.arange(61),dtype='m8[s]')
TypeError: cannot convert timedeltalike to dtype [timedelta64[s]]

Separately is the bug @jorisvandenbossche notes, #11594 which is different (and not a problem with .concat), rather with preserving the dtype on slicing. I'll put an example there

@jreback jreback added this to the Next Major Release milestone Feb 23, 2016

@toobaz

This comment has been minimized.

Member

toobaz commented Feb 23, 2016

@jreback : ... but there is no problem in the .tail() (or .iloc[-1], actually) version of the same Series. How can it be the fault of .astype('timedelta64[s]')?!

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 23, 2016

These are not stored correctly. We only store as timedelta64[ns]

In [1]: s = pd.Series(range(61)).astype('timedelta64[s]')

In [2]: s.values
Out[2]: 
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60], dtype='timedelta64[s]')

In [3]: Series(pd.to_timedelta(range(61),unit='s')).values
Out[3]: 
array([          0,  1000000000,  2000000000,  3000000000,  4000000000,
        5000000000,  6000000000,  7000000000,  8000000000,  9000000000,
       10000000000, 11000000000, 12000000000, 13000000000, 14000000000,
       15000000000, 16000000000, 17000000000, 18000000000, 19000000000,
       20000000000, 21000000000, 22000000000, 23000000000, 24000000000,
       25000000000, 26000000000, 27000000000, 28000000000, 29000000000,
       30000000000, 31000000000, 32000000000, 33000000000, 34000000000,
       35000000000, 36000000000, 37000000000, 38000000000, 39000000000,
       40000000000, 41000000000, 42000000000, 43000000000, 44000000000,
       45000000000, 46000000000, 47000000000, 48000000000, 49000000000,
       50000000000, 51000000000, 52000000000, 53000000000, 54000000000,
       55000000000, 56000000000, 57000000000, 58000000000, 59000000000,
       60000000000], dtype='timedelta64[ns]')
@toobaz

This comment has been minimized.

Member

toobaz commented Feb 23, 2016

OK... but the ordinary __repr__ is aware of this, so it behaves fine. If the snipped version also behaved well, and we just stated "all timedelta64[*] are stored as nanoseconds", wouldn't everybody be happy?

(that is: am I missing an official definition of timedelta64[s] which we should comply with?)

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 23, 2016

no the repr is wrong which is indicative of the internal represenation is wrong. This needs to be fixed in the astyping. Putting something in the docs is the very last thing to do. It is a bug and should be fixed (though technicaly this is not supported, but prob should be)

@toobaz

This comment has been minimized.

Member

toobaz commented Feb 24, 2016

Does the following (on datetime rather than timedelta) also reflect a bug?

In [2]: str(pd.Series(range(100)).astype('datetime64[ms]')).splitlines()[-1]
Out[2]: 'dtype: datetime64[ns]'
@jreback

This comment has been minimized.

Contributor

jreback commented Feb 24, 2016

why are you doing this stringifying thing? that doesn't make any sense.

yes, this is properly converted to datetime64[ns]

In [10]: pd.Series(range(100)).astype('datetime64[ms]').dt.microsecond.head()
Out[10]: 
0       0
1    1000
2    2000
3    3000
4    4000
dtype: int64
In [13]: pd.to_datetime(pd.Series(range(100)), unit='ms').dt.microsecond.head()
Out[13]: 
0       0
1    1000
2    2000
3    3000
4    4000
dtype: int64

note, no millisecond attribute because its not compat with datetime.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 24, 2016

@jreback The stringifying is another way to show what the repr looks like. And this does makes sense in eg the case of timedelta, as the dtype in the repr did not correspond with the actual dtype of the series

But in case of datetime64, these values are stored correctly (so @toobaz what you showed in your last comment is not a bug, but the correct behaviour). When doing astype('datetime64[s]'), the values are correctly interpreted as seconds, and subsequently converted for datetime64[ns] to store it in the series:


In [22]: pd.Series(range(5)).astype('datetime64[s]')
Out[22]:
0   1970-01-01 00:00:00
1   1970-01-01 00:00:01
2   1970-01-01 00:00:02
3   1970-01-01 00:00:03
4   1970-01-01 00:00:04
dtype: datetime64[ns]

In [23]: pd.Series(range(5)).astype('datetime64[s]').values
Out[23]:
array(['1970-01-01T01:00:00.000000000+0100',
       '1970-01-01T01:00:01.000000000+0100',
       '1970-01-01T01:00:02.000000000+0100',
       '1970-01-01T01:00:03.000000000+0100',
       '1970-01-01T01:00:04.000000000+0100'], dtype='datetime64[ns]')

Probably the same should happen for astype('timedelta[s]')?

In [29]: pd.Series(range(5)).astype('timedelta64[s]')
Out[29]:
0   00:00:00
1   00:00:01
2   00:00:02
3   00:00:03
4   00:00:04
dtype: timedelta64[s]

In [30]: pd.Series(range(5)).astype('timedelta64[s]').values
Out[30]: array([0, 1, 2, 3, 4], dtype='timedelta64[s]')
@jreback

This comment has been minimized.

Contributor

jreback commented Feb 24, 2016

@jorisvandenbossche exactly, that's what I noted above. datetime64 are all coerced pretty well. it can be used as a model on how/where to do a similar coercion (which is done already for the constructor and most places, obviously not in astyping).

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 13, 2018

jreback added a commit to jreback/pandas that referenced this issue Jan 13, 2018

jreback added a commit to jreback/pandas that referenced this issue Jan 13, 2018

jreback added a commit that referenced this issue Jan 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment