Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON native support for datetime encoding #4498

Merged
merged 3 commits into from
Aug 15, 2013

Conversation

Komnomnomnom
Copy link
Contributor

This adds support at the C level for encoding datetime values in JSON. I added a new parameter date_unit which controls the precision of the encoding, options are seconds, milliseconds, microseconds and nanoseconds. It defaults to milliseconds which should fix #4362.

I also added support for NaTs and tried to detect when numpy arrays of datetime64 longs are passed.

Note that datetime decoding is still handled at the Python level but I added a date_unit param here too so timestamps can be converted correctly.

valgrind looks happy with the changes and all tests pass ok for me on Python 2.7 & 3.3

@Komnomnomnom
Copy link
Contributor Author

Build failing on Python 2.6 and 3.2, I'll try to get my hands on 2.6 and take a look.

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

  • release notes
  • can you update examplse in io.rst for using date_unit (or I can do ...up 2 u)
  • does the user need to specify the date_unit on deserialization as well?

@Komnomnomnom
Copy link
Contributor Author

Ok

  • numpy 1.6 fixes
  • release notes
  • doc examples for date_unit

does the user need to specify the date_unit on deserialization as well?

Yes date parsing won't work properly otherwise, added an example to the docs about this.

Rebased it all to current master. Want me to squash any of these commits?

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

look good...

I guess its 'impossible' to figure out if the user specified an 'invalid' date_unit, though if its out of range of valid dates then you could...?

e.g. if I wote a date in ns most read-back in s should fail

@Komnomnomnom
Copy link
Contributor Author

I guess we could try and sniff the values to try and deduce the relative magnitude, or alternatively do something like

for unit in ('s', 'ms', 'us', 'ns'):
    try:
        to_datetime(date_data, unit=unit)
        break
    except OverflowError:
        pass

which would only happen if the user passed date_unit=None....

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

ok....that makes sense....default will be 'ms' anyhow, correct?

@Komnomnomnom
Copy link
Contributor Author

Default will be ms when serialising but with the above change I'd be inclined to make None the default when deserialising?

@Komnomnomnom
Copy link
Contributor Author

Ok I've added timestamp precision detection when deserialising. It seems to work pretty well so I've set the default to date_unit=None. Let me know if you guys think a different default would be more appropriate.

raise ValueError('date_unit must be one of %s' %
(self.stamp_units,))
else:
self.min_stamp *= (next((pow(1e3, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the min_stamp just be in a class level dictionary? and then lookup the date_unit, which you then default if date_unit is None; also lower case the date_unit (in case you are passed something else)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok done

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

@Komnomnomnom just a couple of minor points....this is pretty awesome though!

squash down to a couple of commits and do those minor points and can merge

@Komnomnomnom
Copy link
Contributor Author

Ok added a commit with the fixes, will wait to hear back before squashing, so it's easier to review the latest changes.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

ok...one more thing....can you include a small json file from 0.12 (generate it with 0.12), and put in a compat checker, just to make sure that you would interpret correctly? (I think it works, just making sure we don't break the existing)...

@Komnomnomnom
Copy link
Contributor Author

Good call @jreback! This exposed a few things:

1) The NaT handling in 0.12 was not correct, this has already been addressed by the previous changes but it means I wasn't able to include NaT's in my compat test:

0.12

In [81]: s = pd.Series([pd.NaT])

In [82]: s.to_json()
Out[82]: '{"0":-9223372036854775808}'

This PR

In [7]: s = pd.Series([pd.NaT])

In [8]: s.to_json()
Out[8]: '{"0":null}'

Added a line to the release notes about this (bug fix).

2) This is an odd one, I don't fully understand why but if a date column including NaTs was serialised to timestamps then there appears to be slightly different handling in to_datetime during deserialisation and the expected OverflowError is suppressed. I fixed this by adding except -1 to allow the exception to be raised from the guilty Cython func.

cdef inline int64_t cast_from_unit(object unit, object ts) except -1:

All pandas tests still pass for me with this change but I am not that familiar with Cython or this code though so....

3) Had to modify json.py a bit more to handle deserialising NaTs in date columns properly (they'll be deserialised to NaN by the native code). Changed check for iNaT to a call to isnull.

4) I removed the casting to int64 when attempting the to_datetime call, it doesn't seem to be necessary?

@jreback
Copy link
Contributor

jreback commented Aug 14, 2013

I think in the call for NaT testnig, you still need to check if any values == iNaT because that value is only considered a null if the series is of dtype M (e.g. a datetime) at that time, in this case its an int64 (the isnull will use the view to check, but I don't recall if the dtype is set at this point in the code)

@Komnomnomnom
Copy link
Contributor Author

Sweet, that makes it compatible with the NaT handling from 0.12. It still won't be able to parse NaT's from 0.12 serialised with iso format though:

In [187]: s = pd.Series([pd.NaT])

In [189]: s.to_json(date_format='iso')
Out[189]: '{"0":"0001-255-255T00:00:00"}'

@jreback
Copy link
Contributor

jreback commented Aug 14, 2013

gr8; since default was epoch anyhow, I think compat from 0.12 is pretty good

looks good to go....pls squash down (after passing) and will get it in

@Komnomnomnom
Copy link
Contributor Author

Any idea why Travis would be choking on the compat test, it appears it might be a problem related to reading the external file, do they need to be whitelisted or anything?

Works for me locally on the same Python / numpy version.

@jreback
Copy link
Contributor

jreback commented Aug 14, 2013

In setup.py add tests/data/*.json in the following list (I don't think the data files are being copied to the test dirs)

      package_data={'pandas.io': ['tests/data/legacy_hdf/*.h5',
                                  'tests/data/legacy_pickle/0.10.1/*.pickle',
                                  'tests/data/legacy_pickle/0.11.0/*.pickle',
                                  'tests/data/*.csv',
                                  'tests/data/*.dta',
                                  'tests/data/*.txt',
                                  'tests/data/*.xls',
                                  'tests/data/*.xlsx',
                                  'tests/data/*.table',
                                  'tests/data/*.html'],
                    'pandas.tools': ['tests/*.csv'],
                    'pandas.tests': ['data/*.pickle',
                                     'data/*.csv'],
                    'pandas.tseries.tests': ['data/*.pickle',
                                             'data/*.csv']
                    },

@Komnomnomnom
Copy link
Contributor Author

Doh didn't think of that, thanks

@Komnomnomnom
Copy link
Contributor Author

Squishified commits and rebased to master. Good to go once travis signs off if you're happy with it

@jreback
Copy link
Contributor

jreback commented Aug 15, 2013

looks good to me! thanks! bombs away....

jreback added a commit that referenced this pull request Aug 15, 2013
JSON native support for datetime encoding
@jreback jreback merged commit 5958013 into pandas-dev:master Aug 15, 2013
@jreback
Copy link
Contributor

jreback commented Aug 15, 2013

https://travis-ci.org/pydata/pandas/jobs/10223886

I had this fail once and then work locally....??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series/Dataframe to_json not returning miliseconds
3 participants