Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

astype attempts to convert datetime64[ms] as nanoseconds when missing value in data #10746

Closed
dschallis opened this issue Aug 4, 2015 · 10 comments
Labels
API Design Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@dschallis
Copy link

When creating a DataFrame with a millisecond timestamp, created with dtype='datetime64[ms]', this works as expected when there are no missing values in the data:

import pandas as pd
df = pd.DataFrame([1036713600000], dtype='float64')
print(df[0].astype('datetime64[ms]'))

Output:

0   2002-11-08
Name: 0, dtype: datetime64[ns]

Adding a missing value to the data causes the values to get parsed as nanoseconds rather than microseconds, which causes an exception:

df = pd.DataFrame([1036713600000, None], dtype='float64')
print(df[0].astype('datetime64[ms]'))

Output:

Traceback (most recent call last):                                          
  File "./f.py", line 6, in <module>                                        
    print(df[0].astype('datetime64[ms]'))                                   
  File "/Users/dsc/.virtualenvs/p3default/lib/python3.4/site-packages/pandas/core/generic.py", line 2411, in astype
    dtype=dtype, copy=copy, raise_on_error=raise_on_error, **kwargs)        
  File "/Users/dsc/.virtualenvs/p3default/lib/python3.4/site-packages/pandas/core/internals.py", line 2504, in astype
    return self.apply('astype', dtype=dtype, **kwargs)                      
  File "/Users/dsc/.virtualenvs/p3default/lib/python3.4/site-packages/pandas/core/internals.py", line 2459, in apply
    applied = getattr(b, f)(**kwargs)                                       
  File "/Users/dsc/.virtualenvs/p3default/lib/python3.4/site-packages/pandas/core/internals.py", line 373, in astype
    values=values, **kwargs)                                                
  File "/Users/dsc/.virtualenvs/p3default/lib/python3.4/site-packages/pandas/core/internals.py", line 407, in _astype
    fastpath=True, dtype=dtype, klass=klass)                                
  File "/Users/dsc/.virtualenvs/p3default/lib/python3.4/site-packages/pandas/core/internals.py", line 2101, in make_block
    placement=placement)                                                    
  File "/Users/dsc/.virtualenvs/p3default/lib/python3.4/site-packages/pandas/core/internals.py", line 1795, in __init__
    values = tslib.cast_to_nanoseconds(values)                              
  File "pandas/tslib.pyx", line 2622, in pandas.tslib.cast_to_nanoseconds (pandas/tslib.c:43295)
  File "pandas/tslib.pyx", line 1333, in pandas.tslib._check_dts_bounds (pandas/tslib.c:23332)
pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 292278994-08-16 16:47:04

Expected output was something like:

0   2002-11-08
1   NaT
Name: 0, dtype: datetime64[ns]

This was using python 3.4.2, Pandas 0.16.2.

@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

I guess this is a bug, but you should in general simply use pd.to_datetime

In [7]: pd.to_datetime(df[0],unit='ms')
Out[7]: 
0   2002-11-08
1          NaT
Name: 0, dtype: datetime64[ns]

actually I think the astype should simply raise on anything that isn't quite simple.

@jreback jreback added Datetime Datetime data dtype Difficulty Novice API Design Dtype Conversions Unexpected or buggy dtype conversions labels Aug 4, 2015
@jreback jreback added this to the Next Major Release milestone Aug 4, 2015
@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

pull-requests are welcome!

@dschallis
Copy link
Author

Thanks @jreback, I'll use pd.to_datetime for now, and see if I can track down the cause of the error when I get a chance.

@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

@dschallis gr8!. as an FYI, you almost always want to use int64 and not float with timestamps

@dschallis
Copy link
Author

@jreback Yup, unfortunately I'm reading from a JSON dataset with some missing values for timestamps, so pd.read_json is coercing the timestamp to float64.

@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-json

are you passing date_unit ?

how does the JSON represent missing values?

@dschallis
Copy link
Author

@jreback hmm, no, I wasn't using date_unit. My data and code looks something like:

# timestamp field omitted when missing:
data = '''
[{"id": 1, "timestamp": 1036713600000},
 {"id": 2}]
'''

df = pd.read_json(data, orient='records', dtype={'id': 'int32', 'timestamp': 'datetime64[ms]'})

print(df); print(df.info()) produces:

id     timestamp
0   1  1.036714e+12
1   2           NaN

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
id           2 non-null int32
timestamp    1 non-null float64
dtypes: float64(1), int32(1)
memory usage: 40.0 bytes

@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

hmm, I think that should work actually (if you provided date_unit). So sort of a two-fer here.

@dschallis
Copy link
Author

@jreback Interestingly enough, that seems to have the same underlying issue as my original example did, i.e.:

data = '[{"id": 1, "timestamp": 1360287003083988472}, {"id": 2}]'
print(pd.read_json(data, orient='records', dtype={'id': 'int32', 'timestamp': 'datetime64[ns]'}, date_unit='ns'))

Correctly outputs:

   id                     timestamp
0   1 2013-02-08 01:30:03.083988480
1   2                           NaT

Changing to 'ms' instead of 'ns':

data = '[{"id": 1, "timestamp": 1360287003083}, {"id": 2}]'
print(pd.read_json(data, orient='records', dtype={'id': 'int32', 'timestamp': 'datetime64[ms]'}, date_unit='ms'))

Output:

   id     timestamp
0   1  1.360287e+12
1   2           NaN

As before though, I the timestamp gets correctly converted if there are no missing values in the data for milliseconds (only nanosecond conversion works with missing data).
I.e. if I change to data = '[{"id": 1, "timestamp": 1360287003083}]', the output for the above is:

   id               timestamp
0   1 2013-02-08 01:30:03.083

I'll take a look into the root cause if I get a chance tomorrow though!

@jreback
Copy link
Contributor

jreback commented Aug 15, 2015

was closed by #10776

@jreback jreback closed this as completed Aug 15, 2015
@jreback jreback modified the milestones: 0.17.0, Next Major Release Aug 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

2 participants