Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH? to_datetime: ability to infer unit for numeric values #17860

Closed
cailiang9 opened this issue Oct 13, 2017 · 8 comments
Closed

ENH? to_datetime: ability to infer unit for numeric values #17860

cailiang9 opened this issue Oct 13, 2017 · 8 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Timeseries

Comments

@cailiang9
Copy link

I implement a simple function for reference:

def infer_as_datetime(dtarr):
    for unit in ['ns', 'us', 'ms', 's']: # for 'YYYY-MM-dd hh:mm:ss.msec' unit='ns' will succeed.
        dt = pd.to_datetime(dtarr[-3:], unit=unit) # tail values have higher chance with later time not in 1970
        if (dt.year > 1970).any(): return pd.to_datetime(dtarr, unit=unit)
    return pd.to_datetime(dtarr, unit='D')

print(infer_as_datetime([0,365,366]),infer_as_datetime(['1971-12-30 12:34:01.314','2015-01-30 22',]),infer_as_datetime([1442315569.315]),infer_as_datetime([1442315569000]),infer_as_datetime([1442315569000000]),infer_as_datetime([1442315569000000000]))

Expected Output

Output of infer_as_datetime(dtarr)

DatetimeIndex(['1970-01-01', '1971-01-01', '1971-01-02'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['1971-12-30 12:34:01.314000', '2015-01-30 22:00:00'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['2015-09-15 11:12:49.315000'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['2015-09-15 11:12:49'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['2015-09-15 11:12:49'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['2015-09-15 11:12:49'], dtype='datetime64[ns]', freq=None)

@jorisvandenbossche
Copy link
Member

@cailiang9 Can you be a bit more explanation what this issue is about?

@jorisvandenbossche jorisvandenbossche added the Needs Info Clarification about behavior needed to assess issue label Oct 13, 2017
@cailiang9
Copy link
Author

the pd.to_datetime(infer=True) does not work smartly. e.g.: it can not automatically infer the unit, so we have to try with unit='s' for 1442315569.315 or unit='ms' for 1442315569315 .
In the function infer_as_datetime I wrote, it does not need to specify any thing like unit. In most cases, returned dt.year=1970 means the unit is incorrect. infer_as_datetime uses this cue. Because we have a semantic infer=True here, we can guess unit for convenience here. Suprizingly, infer_as_datetime also supports many kinds of format such as 'YYYY-MM-dd hh:mm:ss'.
I hope that pd.to_datetime(infer=True) can support function like in infer_as_datetime. So that, when a array of datetime string or unix timestamp comes, we can simply use pd.to_datetime(infer=True) to transform into datetime without specifying anything.
On the other hand, if a user feel this way is unsafe, they just need to remove infer and specify more concrete parameters.

@jorisvandenbossche
Copy link
Member

I suppose you refer to the infer_datetime_format keyword? (there is no infer keyword) This keyword is meant to infer the format of strings:

infer_datetime_format : boolean, default False
If True and no format is given, attempt to infer the format of the
datetime strings, and if it can be inferred, switch to a faster
method of parsing them. In some cases this can increase the parsing
speed by ~5-10x.

And it deliberately does not handle inferring the unit for numeric values.
Doing that is much less 'clear', as a numeric value can be valid for different units (which is most of the time not the case for string formats):

In [20]: pd.to_datetime(1442315569.315)
Out[20]: Timestamp('1970-01-01 00:00:01.442315569')

In [21]: pd.to_datetime(1442315569.315, unit='ms')
Out[21]: Timestamp('1970-01-17 16:38:35.569315')

In [22]: pd.to_datetime(1442315569.315, unit='s')
Out[22]: Timestamp('2015-09-15 11:12:49.315000')

The above results are all valid.
I agree that that it is likely that the one that has not year 1970 is in many cases the probable one (as you noted), but I am not sure we can use that logic in the to_datetime function. Personally I think it is better to ask to explicitly specify the unit.

@jorisvandenbossche jorisvandenbossche changed the title improve infer function in to_datetime function ENH? to_datetime: ability to infer unit for numeric values Oct 16, 2017
@jorisvandenbossche jorisvandenbossche added Timeseries Needs Discussion Requires discussion from core team before further action and removed Needs Info Clarification about behavior needed to assess issue labels Oct 16, 2017
@jreback
Copy link
Contributor

jreback commented Oct 16, 2017

you can in general only do this w/o fail when the input overflows, so you can start with a lower precision unit (s) and proceed to a higher one (ns); we do something like this in read_json. But as a general rule would be -1.

@cailiang9
Copy link
Author

@jorisvandenbossche infer_datetime_format keyword is just an addition. Or we can print out the selected unit and a warning to tell a user to specify unit explicitly in the function .
Users can always explicitly specify the unit if they want.
In practical use when seeing numeric values, people also have no idea which unit to use and have to try one by one. So infer_datetime_format keyword would be very helpful.

@shuaitang5
Copy link

shuaitang5 commented Sep 9, 2018

I know this is an old ticket, but since it's still open, I would like to add something here.

I recently ran into a problem that the dataframe contains a column of epoch timestamp strings. While most of them are 13-digits integers (1536507914000, unit in ms), but there are a few cases that are 16-digits integers (1536507914123000, unit in us). In other words, with me explicitly passing unit as ms, the rows that have 16-digits epoch timestamps will have a NaT timestamp, which is not what I wanted later on.

>>> df = pd.DataFrame({
...     'ts': [1536507914000, 1536507915000, 1536507916000, 1536507917123000, 1536507918000],
...     'val': [1.2, 2.01, 2.0, 3.9, 1.9]})
>>> df
                 ts   val
0     1536507914000  1.20
1     1536507915000  2.01
2     1536507916000  2.00
3  1536507917123000  3.90
4     1536507918000  1.90
>>> pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
0   1970-01-01 00:25:36.507914
1   1970-01-01 00:25:36.507915
2   1970-01-01 00:25:36.507916
3   1970-01-18 18:48:27.917123
4   1970-01-01 00:25:36.507918
Name: ts, dtype: datetime64[ns]
>>> pd.to_datetime(df['ts'], unit='ms', errors='coerce')
0   2018-09-09 15:45:14
1   2018-09-09 15:45:15
2   2018-09-09 15:45:16
3                   NaT
4   2018-09-09 15:45:18
Name: ts, dtype: datetime64[ns]
>>> pd.to_datetime(df['ts'], unit='us', errors='coerce')
0   1970-01-18 18:48:27.914
1   1970-01-18 18:48:27.915
2   1970-01-18 18:48:27.916
3   2018-09-09 15:45:17.123
4   1970-01-18 18:48:27.918
Name: ts, dtype: datetime64[ns]

The data I'm working with has many millions of records, so it's hard for me to spot a few hundred records with 16 digits epoch timestamps, until today I accidentally found the bug.

So I agree with @cailiang9 that auto inferring unit of timestamps would be very helpful, especially for epoch timestamps. @jorisvandenbossche Could we add some sort of length check when parsing epoch timestamps, so that the pd._todatetime will automatically infer the unit based on length?

@jbrockmendel
Copy link
Member

xref #15836 looks very similar.

@jbrockmendel
Copy link
Member

Discussed in yesterday's dev call, there was consensus that we don't want to be guessing here. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Timeseries
Projects
None yet
Development

No branches or pull requests

6 participants