Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support writing timestamps with timezones with to_sql #22654

Merged
merged 36 commits into from
Nov 8, 2018

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Sep 10, 2018

@pep8speaks
Copy link

Hello @mroeschke! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Sep 10, 2018

Codecov Report

Merging #22654 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22654   +/-   ##
=======================================
  Coverage   92.17%   92.17%           
=======================================
  Files         169      169           
  Lines       50708    50708           
=======================================
  Hits        46740    46740           
  Misses       3968     3968
Flag Coverage Δ
#multiple 90.58% <ø> (ø) ⬆️
#single 42.37% <ø> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73dd6ec...c7c4a7a. Read the comment docs.

@mroeschke mroeschke added Enhancement IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype labels Sep 11, 2018
@mroeschke mroeschke changed the title WIP: Support writing timestamps with timezones with to_sql ENH: Support writing timestamps with timezones with to_sql Sep 12, 2018
@mroeschke mroeschke added this to the 0.24.0 milestone Sep 12, 2018
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! this was long overdue :-)

# GH 9086
if self.flavor != 'postgresql':
msg = "{} does not support datetime with time zone"
pytest.skip(msg.format(self.flavor))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we test assert the behaviour for a database that does not support it? Eg what happens if you write a column with timezone aware data to mysql?
From the sqlalchemy docs, it seems the flag timezone=True is simply ignored, but how are then the datetime values we send it stored? Does that change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the values send change:

In [67]: idx = pd.date_range("2012-01-01 09:00", periods=3, tz='Europe/Brussels')

In [68]: idx.values.astype('M8[us]').astype(object)
Out[68]: 
array([datetime.datetime(2012, 1, 1, 8, 0),
       datetime.datetime(2012, 1, 2, 8, 0),
       datetime.datetime(2012, 1, 3, 8, 0)], dtype=object)

In [69]: idx.to_pydatetime()
Out[69]: 
array([ datetime.datetime(2012, 1, 1, 9, 0, tzinfo=<DstTzInfo 'Europe/Brussels' CET+1:00:00 STD>),
       datetime.datetime(2012, 1, 2, 9, 0, tzinfo=<DstTzInfo 'Europe/Brussels' CET+1:00:00 STD>),
       datetime.datetime(2012, 1, 3, 9, 0, tzinfo=<DstTzInfo 'Europe/Brussels' CET+1:00:00 STD>)], dtype=object)

but not sure if that changes how the values are then stored in the mysql database (since the utc / naive time is still the same), but so this might be worth testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. From earlier tests databases that don't support timestamp with timezone read back data as naive (don't know if this is local naive or converted to UTC first and then returned naive though). Will add some tests.

def test_date_parsing(self):
# No Parsing
df = sql.read_sql_table("types_test_data", self.conn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we rather test that it is not parsed instead of removing it? (but I agree this currently looks like this is not doing much)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. My cursory glance thought it was duplicate of the line below; you're right, I will try to add an assert to this result as well

df.to_sql('test_datetime_tz', self.conn)

expected = df.copy()
expected['A'] = expected['A'].dt.tz_convert('UTC')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that I remember correctly: when reading, and we get datetime objects with an offset (if there is a timestamp with timezone), we always convert to utc because we cannot really know what is the timezone? (we get a simple 'fixed offset' ?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, so we should expect this round trip test to load US/Pacific and return UTC

@mroeschke
Copy link
Member Author

@jorisvandenbossche so looks like based on the passing tests, databases without timezone support store those the data as naive local time.

@codecov
Copy link

codecov bot commented Sep 19, 2018

Codecov Report

Merging #22654 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22654   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files         161      161           
  Lines       51224    51224           
=======================================
  Hits        47254    47254           
  Misses       3970     3970
Flag Coverage Δ
#multiple 90.63% <ø> (ø) ⬆️
#single 42.3% <ø> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 96.81% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28a42da...ef3b20f. Read the comment docs.

@@ -179,6 +179,7 @@ Other Enhancements
- :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
- :func:`~DataFrame.to_csv`, :func:`~Series.to_csv`, :func:`~DataFrame.to_json`, and :func:`~Series.to_json` now support ``compression='infer'`` to infer compression based on filename extension (:issue:`15008`).
The default compression for ``to_csv``, ``to_json``, and ``to_pickle`` methods has been updated to ``'infer'`` (:issue:`22004`).
- :func:`to_sql` now supports writing ``TIMESTAMP WITH TIME ZONE`` columns (:issue:`9086`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be

:meth:`DataFrame.to_sql`

@mroeschke
Copy link
Member Author

So the only significant consequence is described here in the Datetimelike API Changes:

- :meth:`DataFrame.to_sql` now writes timezone aware datetime data (``datetime64[ns, tz]`` dtype) as timezone unaware local timestamps instead of timezone unaware UTC timestamps for database dialects that don't support the ``TIMESTAMP WITH TIME ZONE`` type. See the :ref:`io.sql_datetime_data` for implications (:issue:`9086`).

Roundtrips Before:

  • tz-aware DataFrame --> database that doesn't support timezones --> tz-naive UTC DataFrame
  • tz-aware DataFrame --> database that does support timezones --> tz-naive UTC DataFrame

Roundtrips After:

  • tz-aware DataFrame --> database that doesn't support timezones --> tz-naive local DataFrame
  • tz-aware DataFrame --> database that does support timezones --> tz-aware UTC DataFrame

So addressing your last comment's bullet points:

  • No error is raised. The data will now be written as naive local instead of naive UTC
  • Correct. Before, pandas read tz aware data from databases as UTC and that behavior has not changed.

@mroeschke
Copy link
Member Author

Yeah given this discussion; it would be wise to highlight this change as its own section in the whatsnew.

@mroeschke
Copy link
Member Author

@jorisvandenbossche added a new whatsnew section in API detailing the before and after implications of this change.

@jorisvandenbossche
Copy link
Member

So yesterday I tried to make a small illustrative example showing the changed round-trip, but did it actually work before?
Because I get:

In [1]: import sqlalchemy

In [3]: engine = sqlalchemy.create_engine('sqlite:///:memory:')

In [4]: df = pd.DataFrame({'a':[1,2,3], 'b': pd.date_range('2016-01-01', periods=3, tz='Europe/Brussels')})

In [5]: df.to_sql('table', engine)
...
TypeError: Cannot cast DatetimeIndex to dtype datetime64[us]

In [6]: pd.__version__
Out[6]: '0.23.4'

So that is my source of confusion. I thought it worked before (and you seemed to confirm it), so therefore I assumed that there is a change in behaviour. But is that actually the case?
Do you have a running example of the cases that you list in the table above?

@mroeschke
Copy link
Member Author

Hmm interesting. I made a (bad) assumption that the tests covered this case, but I actually don't see a test case where to_sql is tested with timezones. So the behavior you're seeing is probably correct.

I'll try to confirm tonight on other dbs as well, but I guess it looks like I fixed a bug too :)

@mroeschke
Copy link
Member Author

I ran the same test on postgres @jorisvandenbossche and I got the same result.

Therefore you're correct in that there really isn't a change in behavior since previously to_sql and timezone aware datetimes raised an error. I reflected this in a separate whatsnew entry in the io bug section.

@mroeschke
Copy link
Member Author

Added a fix for #23510 since it was a pretty easy addition to this PR

@TomAugspurger
Copy link
Contributor

Small issue with the new test I think https://travis-ci.org/pandas-dev/pandas/jobs/451739169#L1986

_____ TestMySQLAlchemy.test_naive_datetimeindex_roundtrip[load_iris_data0] _____
self = <pandas.tests.io.test_sql.TestMySQLAlchemy object at 0x7febe71173d0>
    def test_naive_datetimeindex_roundtrip(self):
        # GH 23510
        # Ensure that a naive DatetimeIndex isn't converted to UTC
        dates = date_range('2018-01-01', periods=5, freq='6H')
        expected = DataFrame({'nums': range(5)}, index=dates)
        expected.to_sql('foo_table', self.conn, index_label='info_date')
        result = sql.read_sql_table('foo_table', self.conn
                                    index_col='info_date')
>       tm.assert_frame_equal(result, expected)
E       AssertionError: DataFrame.index are different
E       
E       Attribute "names" are different
E       [left]:  [u'info_date']
E       [right]: [None]

looks good otherwise.

@mroeschke
Copy link
Member Author

Thanks @TomAugspurger. Fixed that test (I don't think index name is expected to roundtrip in that test)

@@ -1275,6 +1246,9 @@ MultiIndex
I/O
^^^

- Bug in :meth:`to_sql` when writing timezone aware data (``datetime64[ns, tz]`` dtype) would raise a ``TypeError`` (:issue:`9086`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to the "enhancements" though, I would say that timezones were simply never supported, so it is a nice enhancement that we will now actually support it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now looking at the full diff (and not only what changed recently), and see you actually already have that. It's a bit duplicated now, but I am fine with keeping it in both places.

if col.dt.tz is not None:
return TIMESTAMP(timezone=True)
except AttributeError:
# The column is actually a DatetimeIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding: where was this attribute error catched before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoop, sorry, again misunderstanding from not looking at the full diff :-)

@jorisvandenbossche
Copy link
Member

Lets's finally merge this! @mroeschke thanks for the endurance :-)

@jorisvandenbossche jorisvandenbossche merged commit d0ec813 into pandas-dev:master Nov 8, 2018
@mroeschke mroeschke deleted the writing_timezone_sql branch November 8, 2018 18:02
thoo added a commit to thoo/pandas that referenced this pull request Nov 10, 2018
…fixed

* upstream/master: (47 commits)
  CLN: remove values attribute from datetimelike EAs (pandas-dev#23603)
  DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381)
  PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589)
  PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591)
  TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502)
  Update description of Index._values/values/ndarray_values (pandas-dev#23507)
  Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552)
  DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953)
  ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654)
  CI: Auto-cancel redundant builds (pandas-dev#23523)
  Preserve EA dtype in DataFrame.stack (pandas-dev#23285)
  TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468)
  BUG: raise if invalid freq is passed (pandas-dev#23546)
  remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562)
  BUG: Fix error message for invalid HTML flavor (pandas-dev#23550)
  ENH: Support EAs in Series.unstack (pandas-dev#23284)
  DOC: Updating DataFrame.join docstring (pandas-dev#23471)
  TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888)
  BUG: Return KeyError for invalid string key (pandas-dev#23540)
  BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852)
  ...
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype
Projects
None yet
5 participants