Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: to_timedelta unit-argument ignored for string input #12136

Closed
belteshassar opened this issue Jan 25, 2016 · 26 comments · Fixed by #34634
Closed

BUG/API: to_timedelta unit-argument ignored for string input #12136

belteshassar opened this issue Jan 25, 2016 · 26 comments · Fixed by #34634
Labels
Bug Timedelta Timedelta data type
Milestone

Comments

@belteshassar
Copy link

So, I don't know if this is by design, but it sure confused me. to_timedelta() operates differently on the number 1000 compared to the string '1000':

In[3]: pd.to_timedelta(1000, unit='ms')
Out[3]: Timedelta('0 days 00:00:01')
In[4]: pd.to_timedelta(1000.0, unit='ms')
Out[4]: Timedelta('0 days 00:00:01')

and

In[6]: pd.to_timedelta('1000ms')
Out[6]: Timedelta('0 days 00:00:01')

while

In[5]: pd.to_timedelta('1000', unit='ms')
Out[5]: Timedelta('0 days 00:00:00.000001')

I've tried my best to come up with a reason for why this behaviour would be desirable, but I can't think of any.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2016

this is a bug. The unit argument only matters when you don't have units for a particular entry.
However when parsing as a string IIRC its just ignored, so that should be handled (this is in the cython code)

separately, I think we might want an error if you have mixed with unit specifiers and unit is specified e.g.

In [2]: pd.to_timedelta(['1000ms',1000])
Out[2]: TimedeltaIndex(['00:00:01', '00:00:00.000001'], dtype='timedelta64[ns]', freq=None)

In [3]: pd.to_timedelta(['1000ms',1000],unit='ms')
Out[3]: TimedeltaIndex(['00:00:01', '00:00:01'], dtype='timedelta64[ns]', freq=None)

the first has a default of unit='ns', so maybe want to make the default None (and you can still
default it if its all integer like now).

want to do a pull-request?

@jreback jreback added this to the Next Major Release milestone Jan 25, 2016
@belteshassar
Copy link
Author

want to do a pull-request?

I don't promise anything, but I can take a look at the code base and see if I can get my head around what needs to be done.

@jreback
Copy link
Contributor

jreback commented Jan 26, 2016

gr8. contributing docs are here

post any questions on this issue.

duncanwp added a commit to duncanwp/pandas that referenced this issue Feb 17, 2016
@belteshassar
Copy link
Author

I finally got around to looking at this. I noticed @duncanwp made some commits targeting this issue. Is that enough to solve it?

@jreback
Copy link
Contributor

jreback commented Mar 11, 2016

@belteshassar you can certainly start with that. No pull-request yet.

@duncanwp
Copy link

No, sorry. I thought about passing the unit string into parse_timedelta_string deep down on tslib.pyx explicitly and made a bit of progress, but it's not finished.

Feel free to pick it up or disregard as you like, I don't think I'll have time to finish it.

@belteshassar
Copy link
Author

While working on this, I've found another issue that I'll try to fix with the same patch.

Edit: Was a bit more complicated than I thought. Posted it as #12691 instead.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

I think you can go ahead and add, I don't think there is a reason. Further, you can reorg the tests a bit: see here, and make several more test functions (as the current one is getting long).

The real diff for month,years is they are fixed-length units (because they are not relative to things like offsets), so while accepted are not that useful.

@belteshassar
Copy link
Author

separately, I think we might want an error if you have mixed with unit specifiers and unit is specified e.g.

I've been thinking a bit about this and would like some more input. As I see it, we have at least five cases that could appear:

  1. to_timedelta('1234', unit='ms'): This one is straight forward and what initially started the issue.
  2. to_timedelta('1234ms', unit='ms'): Here the unit matches the unit in the string, this should be discouraged, but does it really need to raise an error when there is no ambiguity?
  3. to_timedelta('1234us', unit='ms'): Now this is becoming ambiguous. One could argue that this should raise ValueError, but on the other hand, one could decide to use the unit parameter as a suggestion and ignore it for all strings with included unit qualifiers.
  4. to_timedelta('01:02:03', unit='ms'): Here we have a recognized format that is incompatible with the unit specifed.
  5. to_timedelta('01:02', unit='m'): Here we have a recognized format that is possibly compatible with the unit specifed.

I lean against either raising ValueError for all non-numeric strings if unit is specified or simply ignoring unit if the string cannot be cast to float.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

I would accept 1), raise on the rest.
if you have a unit arg, easiest is to pd.to_numeric(x) which will coerce to integer/float (and raise if not). Then you are done.

@belteshassar
Copy link
Author

Neat. I was unaware of pd.to_numeric(). Certainly makes this much easier. However it only works for list-like arguments.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

ok, you could enhance to_numeric to accept a scalar as well (just need another in the if, and define a function to make this happen, see to_timedelta for a sketch on how to do this). & tests!

@belteshassar
Copy link
Author

Sounds good. That was my thought as well. I suppose the Timedelta constructor should mirror the behaviour of to_timedelta for scalar args so I guess I should go ahead and fix that as well. The TimedeltaIndex constructor already calls to_timedelta if unit is specified so that needs no special attention.

...and tests, indeed.

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

@belteshassar if you want to come back to this would be great!

@belteshassar
Copy link
Author

I might be able to get back to it during next week. Or I guess I could push what I have so far if someone else wants to take over.

@Sup3rGeo
Copy link
Contributor

Sup3rGeo commented Mar 28, 2018

We should not raise if Timedelta('12s', unit='ms') because documentation is explicit that the unit parameter:

Denote the unit of the input, if input is an integer. Default ‘ns’.

So, only "if" is an integer (actually number, more accurately considering the implementation). And in fact the current behavior (we don't want to break backwards compatibility) is for the above timedelta to return 12 seconds, not raise.

So for strings with pure numbers I would suggest just trying to convert the string parameter to float first. If not, then just proceed with the unchanged string.

try:
    value = float(value)
except ValueError:
    pass
(... function as usual, using value ...)

If anyone else agrees I could make a PR for this.

@jreback
Copy link
Contributor

jreback commented Mar 28, 2018

@Sup3rGeo no this should raise as it is a specification error (IOW raise a ValueError). We don't want to guess here, the user specified that the integer is in a certain unit. If its not a number this is a problem.

@Sup3rGeo
Copy link
Contributor

Sup3rGeo commented Mar 28, 2018

@jreback to make things clearer, I think we might be talking about two different things here.

  1. Whether we should accept things like Timedelta('10 us', unit='s').
  • In your first post on 23 Mar, about the 5 possibilities @belteshassar presented, you mentioned it should raise (possibilities 2 and 3).
  • But the current implementation gives a timedelta with '10us' from that, ignoring unit parameter. And is also aligned with documentation that says that unit parameter is valid only for int (although it also work for float), so it is ignored because we have a string. It does not seem to be a big deal to have documented priorities when we have two possibly conflicting options.
  • Changing the behavior to raise would break backwards compatibility.
  • Therefore I believe we should keep current behaviour, not raising. But we could issue a warning so user is aware of what is happening.

2 - How to make Timedelta('10', unit='s') work (what this issue is about)

  • In your first post on 23 Mar, about the 5 possibilities @belteshassar presented, you mentioned it should be allowed (possibility 1).
  • This would break specification as per the documentation, as we are now making the unit parameter work for things other than ints and floats - in this case, a string.
  • But my opinion is that we can relax the specification so that unit parameter works for numbers in general, be it ints, floats or anything that can be converter to floats. In this case the string "10" would fulfill that condition.
  • Now, assuming then this is the desired behavior, we could implement just by using the input value casted to float (this is the test that we have a number). If not we just use the input value unchanged.

Please let me know if I misunderstood anything we are discussing.

@Sup3rGeo
Copy link
Contributor

Also related to #16791

@TomAugspurger
Copy link
Contributor

@Sup3rGeo I think the docs are too literal about arg being an integer. We can accept integers or strings that can be coerced to integers. We want to treat Timedelta(10, unit='s') and Timedelta('10', unit='s') the same.

Changing Timedelta('10s', unit='us') would be labeled as a bug fix, not an API change.

@Sup3rGeo
Copy link
Contributor

@TomAugspurger, I share the same opinion but I understood that @jreback was against it.

Also, when you say changing Timedelta('10s', unit='us') you mean so that it raises?

@jorisvandenbossche
Copy link
Member

We want to treat Timedelta(10, unit='s') and Timedelta('10', unit='s') the same.

Is it needed that we treat strings parsed as integers? (in contrast to strings with a unit specification inside the string like Timedelta('10s'))
Personally I would just raise an error for those cases.

@jorisvandenbossche
Copy link
Member

However, if we do this consistently we should probably also disallow pd.to_timedelta('10')? (which is currently working)

@jorisvandenbossche jorisvandenbossche changed the title unit-argument of to_timedelta() has no effect when arg is string BUG/API: to_timedelta unit-argument ignored for string input Nov 21, 2018
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 21, 2018

Copying here the examples of the duplicate issues I closed:

From #23716 (@jzwinck):

In : pd.to_timedelta(123456789, unit='s')
Out: Timedelta('1428 days 21:33:09') # ok

In : pd.to_timedelta('123456789', unit='s')
Out: Timedelta('0 days 00:00:00.123456') # oops!

In : pd.to_timedelta('123.45', unit='s')
Out: ValueError('no units specified') # ???

If a string is passed that is a proper integer, it is interpreted as nanoseconds regardless the unit kwarg. In addition, when it is a "float as string" there is an error, but a float itself works with unit.

From #16791 (@jtf621):

In [8]: pd.to_timedelta('1', unit='h')
Out[8]: Timedelta('0 days 00:00:00.000000')

@jorisvandenbossche
Copy link
Member

So the fact that the keyword is ignored is for sure a bug we should fix.

The question for me is then: should we actually allow strings to be interpreted as integers? The docs clearly say that unit is only for integer input (and in practice also for floats), and we could simply follow that (if you want a string to be interpreted as a number, the user can always convert it to a number themselves)
However, @chris-b1 pointed out in #16791 (comment) that to_datetime allows number-like strings to be interpreted with a unit (although not for unit='ns', that's probably a bug). So consistency between those two could also be good.

If we do allow it, I would in any case propose to require a unit, and not silently assume 'ns' (similar proposal for to_datetime about that is here: #15836)

@jorisvandenbossche
Copy link
Member

Ah, there is the PR #23025 working on this that is actually doing this unit=None to require a unit when a string is passed, forgot about that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment