ENH: enable pd.cut to handle i8 convertibles #14714

Closed
jreback opened this Issue Nov 22, 2016 · 15 comments

Comments

Projects
None yet
2 participants
Contributor

jreback commented Nov 22, 2016 edited

so this should work for timedeltas AND datetimes.

Should be straight forward. detect an i8 convertible. turn into i8. do the cut. turn back to the original dtype.

In [15]: s = Series(pd.to_timedelta(np.random.randint(0,100,size=10),unit='ms')).sort_values()

In [16]: s
Out[16]:
3   00:00:00.005000
5   00:00:00.007000
7   00:00:00.010000
4   00:00:00.017000
9   00:00:00.023000
1   00:00:00.043000
0   00:00:00.045000
6   00:00:00.047000
8   00:00:00.065000
2   00:00:00.090000
dtype: timedelta64[ns]

In [18]: pd.cut(s, 5)
TypeError: unsupported operand type(s) for +: 'Timedelta' and 'float'

# works when converted
In [17]: pd.cut(s.astype('timedelta64[ms]'), 5)
Out[17]:
3    (4.915, 22]
5    (4.915, 22]
7    (4.915, 22]
4    (4.915, 22]
9       (22, 39]
1       (39, 56]
0       (39, 56]
6       (39, 56]
8       (56, 73]
2       (73, 90]
dtype: category
Categories (5, object): [(4.915, 22] < (22, 39] < (39, 56] < (56, 73] < (73, 90]]

jreback added this to the Next Major Release milestone Nov 22, 2016

jreback changed the title from ENH: enable pd.cut to handle timedeltas to ENH: enable pd.cut to handle i8 convertibles Nov 22, 2016

jreback added the Timeseries label Nov 22, 2016

Contributor

aileronajay commented Nov 23, 2016

@jreback in the above example should the return type of the final object be 'timedelta64'? (the datatype of the original input)

Contributor

jreback commented Nov 23, 2016

yes (it should be original dtype)

Contributor

aileronajay commented Nov 23, 2016

@jreback Does it currently return strings? I got the below output by printing typeof over the members of category object returned
Categories (5, object): [(2.915, 20] < (20, 37] < (37, 54] < (54, 71] < (71, 88]]
(2.915, 20] <type 'str'>
(2.915, 20] <type 'str'>
(37, 54] <type 'str'>
(37, 54] <type 'str'>
(37, 54] <type 'str'>
(37, 54] <type 'str'>
(54, 71] <type 'str'>
(54, 71] <type 'str'>
(71, 88] <type 'str'>
(71, 88] <type 'str'>

Contributor

jreback commented Nov 23, 2016

yes it returns strings
we don't have an interval type currently

Contributor

aileronajay commented Nov 23, 2016

@jreback i am bit confused now, as part of this enhancement we first need convert to a dtype which cut can handle (timedelta64[ms]) and then return the type (timedelta64[ns]) from which we originally started. Though the objects returned will still be strings but they will be strings composed of object types that we initially passed to cut (timedelta64[ns])?

Contributor

jreback commented Nov 23, 2016

yes this is a bit tricky. I think you:

  • convert to i8
  • do the binning
  • construct the labels based on the bins / dtype
  • stringify them
Contributor

aileronajay commented Nov 23, 2016

@jreback thanks, this is what i was thinking

Contributor

jreback commented Nov 23, 2016

IF we had an Interval type then this would be very easy (#8625), e.g. Period is an interval type for datetimes (but not actually implemented that way, and has slightly different semantics).

Contributor

aileronajay commented Nov 23, 2016

@jreback is there an error with the way i am making this round trip

I start with s which is timedelta64[ns]. This is what "s". I then convert using the astype conversion used earlier here. Then i convert back to timedelta64[ns] using another as type conversion. But this conversion does not retain data and r is just a list having no time information

s = pd.Series(pd.to_timedelta(np.random.randint(0,100,size=10),unit='ms')).sort_values()
print s
p = s.astype('timedelta64[ms]')
r = p.astype(s.dtype)
print r

python data_conversion.py
3 00:00:00.008000
5 00:00:00.015000
8 00:00:00.031000
1 00:00:00.040000
9 00:00:00.045000
6 00:00:00.046000
2 00:00:00.072000
0 00:00:00.082000
4 00:00:00.091000
7 00:00:00.091000
dtype: timedelta64[ns]
3 00:00:00.000000
5 00:00:00.000000
8 00:00:00.000000
1 00:00:00.000000
9 00:00:00.000000
6 00:00:00.000000
2 00:00:00.000000
0 00:00:00.000000
4 00:00:00.000000
7 00:00:00.000000
dtype: timedelta64[ns]

Contributor

jreback commented Nov 23, 2016

no, you will always convert to ns and always back from ns. (internally you will do

.values.view('i8')

then pd.to_timedelta(result, unit='ns') to convert back.

Contributor

aileronajay commented Nov 24, 2016 edited

@jreback does this code block emulate the change we want to make?

import pandas as pd
import numpy as np
import re
s = pd.Series(pd.to_timedelta(np.random.randint(0,100,size=10),unit='ms')).sort_values()
print s
p = s.astype('timedelta64[ms]')
r = pd.cut(p,5)
for elem in r:
k = elem.split(',')
a = re.sub('[^0-9.]','', k[0])
b = re.sub('[^0-9.]','', k[1])
print pd.to_timedelta(float(a) , unit='ms'),pd.to_timedelta(float(b) , unit='ms')

output

1 00:00:00.012000
6 00:00:00.025000
7 00:00:00.042000
9 00:00:00.043000
3 00:00:00.057000
0 00:00:00.061000
2 00:00:00.071000
5 00:00:00.083000
8 00:00:00.086000
4 00:00:00.099000
dtype: timedelta64[ns]
0 days 00:00:00.011913 0 days 00:00:00.029400
0 days 00:00:00.011913 0 days 00:00:00.029400
0 days 00:00:00.029400 0 days 00:00:00.046800
0 days 00:00:00.029400 0 days 00:00:00.046800
0 days 00:00:00.046800 0 days 00:00:00.064200
0 days 00:00:00.046800 0 days 00:00:00.064200
0 days 00:00:00.064200 0 days 00:00:00.081600
0 days 00:00:00.081600 0 days 00:00:00.099000
0 days 00:00:00.081600 0 days 00:00:00.099000
0 days 00:00:00.081600 0 days 00:00:00.099000

Contributor

aileronajay commented Nov 24, 2016

@jreback should a series s of pandas.TimeDelta objects on s.astype('timedelta64[ns]') return a series having a numpy timedelta64 objects with time in seconds. Currently on doing s.astype('timedelta64[ns]') it return back pandas.TimeDelta objects instead of the numpy equivalent

Contributor

jreback commented Nov 24, 2016

we always return pandas objects (for timedelta / datetime)

Contributor

aileronajay commented Nov 25, 2016

@jreback is it a good idea to use infer_dtype (from pandas.lib) to test if the object that we are calling cut on is of type datetime or timedelta, to decide if we need to do a conversion to timedelta[64] or datetime[64]?

Contributor

jreback commented Nov 25, 2016

no

use needs_i8_conversion

@jreback jreback added a commit that referenced this issue Dec 22, 2016

@aileronajay @jreback aileronajay + jreback ENH: allowing datetime and timedelta datatype in pd cut bins
xref #14714, follow-on to #14737

Author: Ajay Saxena <aileronajay@gmail.com>

Closes #14798 from aileronajay/cut_timetype_bin and squashes the following commits:

82bffa1 [Ajay Saxena] added method for time type bins in pd cut and modified tests
ac919cf [Ajay Saxena] added test for datetime bin type
355e569 [Ajay Saxena]  allowing datetime and timedelta datatype in pd cut bins
3e4f839

@ShaharBental ShaharBental added a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016

@aileronajay @ShaharBental aileronajay + ShaharBental ENH: allowing datetime and timedelta datatype in pd cut bins
xref #14714, follow-on to #14737

Author: Ajay Saxena <aileronajay@gmail.com>

Closes #14798 from aileronajay/cut_timetype_bin and squashes the following commits:

82bffa1 [Ajay Saxena] added method for time type bins in pd cut and modified tests
ac919cf [Ajay Saxena] added test for datetime bin type
355e569 [Ajay Saxena]  allowing datetime and timedelta datatype in pd cut bins
4c75674
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment