BUG or DOC: pd.read_csv with parse_dates does not recognize timezone #22256

swyoon · 2018-08-09T08:45:12Z

When parsing a timezone-aware datetime in a csv file with pd.read_csv + parse_dates, it returns naive timestampes converted to UTC, and it was a surprise for me.

Example

Consider we are reading the following data. Let's say its name is pandas_read_csv_bug.csv.
It is a simple timeseries data with timezone (UTC+09:00) specified.

dt,val
2018-01-04 09:01:00+09:00,23350
2018-01-04 09:02:00+09:00,23400
2018-01-04 09:03:00+09:00,23400
2018-01-04 09:04:00+09:00,23400
2018-01-04 09:05:00+09:00,23400

I want to read it with pd.read_csv using parse_dates keyword argument activated.
If working properly, this seems to be the most elegant solution.

import pandas as pd
df = pd.read_csv('pandas_read_csv_bug.csv', parse_dates=['dt'])

However, the result is a data frame df with strange timestamps.

	dt	val
0	2018-01-04 00:01:00	23350
1	2018-01-04 00:02:00	23400
2	2018-01-04 00:03:00	23400
3	2018-01-04 00:04:00	23400
4	2018-01-04 00:05:00	23400

Problem description

My surprise was,

The parsed datetimes are timezone-naive. df['dt'].iloc[0].tz is None == True
The timestampe is automatically converted to UTC.

My first impression was that it shouldn't be the best possible behavior.
However, as an UTC offset does not uniquely corresponds to a single timezone, this could be the safest/most reasonable behavior.
In that case, the documentation should mention this behavior.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.3.1
pip: 9.0.3
setuptools: 38.5.1
Cython: None
numpy: 1.15.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-08-09T20:09:11Z

I agree with your first impression. Patch and PR are welcome!

jbrockmendel · 2018-08-09T20:21:55Z

Can you try this on master? @mroeschke had a PR recently that I think should have fixed this.

gfyoung · 2018-08-10T07:56:49Z

@jbrockmendel : Unfortunately, no luck. I can reproduce this on master (475e391).

mroeschke · 2018-08-10T17:07:43Z

On master, if date_parser=pd.to_datetime is specifically passed, then the dates are not automatically converted to UTC:

In [6]: s = StringIO("""dt,val
   ...: 2018-01-04 09:01:00+09:00,23350
   ...: 2018-01-04 09:02:00+09:00,23400
   ...: 2018-01-04 09:03:00+09:00,23400
   ...: 2018-01-04 09:04:00+09:00,23400
   ...: 2018-01-04 09:05:00+09:00,23400""")

In [7]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=pd.to_datetime)

In [8]: df
Out[8]:
                         dt    val
0 2018-01-04 09:01:00+09:00  23350
1 2018-01-04 09:02:00+09:00  23400
2 2018-01-04 09:03:00+09:00  23400
3 2018-01-04 09:04:00+09:00  23400
4 2018-01-04 09:05:00+09:00  23400

In [10]: df['dt']
Out[10]:
0   2018-01-04 09:01:00+09:00
1   2018-01-04 09:02:00+09:00
2   2018-01-04 09:03:00+09:00
3   2018-01-04 09:04:00+09:00
4   2018-01-04 09:05:00+09:00
Name: dt, dtype: datetime64[ns, pytz.FixedOffset(540)]

I think we use to_datetime under the hood of read_csv to parse dates, so there may be a keyword argument or extra function that is coercing to UTC without timezone.

On that note, what should be the expected behavior?

In [7]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=pd.to_datetime)

# Keep the offsets (as FixedOffset)
In [8]: df
Out[8]:
                         dt    val
0 2018-01-04 09:01:00+09:00  23350
1 2018-01-04 09:02:00+09:00  23400
2 2018-01-04 09:03:00+09:00  23400
3 2018-01-04 09:04:00+09:00  23400
4 2018-01-04 09:05:00+09:00  23400

In [14]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=partial(pd.to_datetime, utc=True))

# More in line with the previous behavior but correctly keeps the UTC timezone
In [15]: df
Out[15]:
                         dt    val
0 2018-01-04 00:01:00+00:00  23350
1 2018-01-04 00:02:00+00:00  23400
2 2018-01-04 00:03:00+00:00  23400
3 2018-01-04 00:04:00+00:00  23400
4 2018-01-04 00:05:00+00:00  23400

mroeschke · 2018-08-11T03:43:27Z

This change fixed this issue specifically but not sure how it will affect other tests.

diff --git a/pandas/io/parsers.py b/pandas/io/parsers.py
index 4b3fa08e5..86f3b6812 100755
--- a/pandas/io/parsers.py
+++ b/pandas/io/parsers.py
@@ -3033,7 +3033,7 @@ def _make_date_converter(date_parser=None, dayfirst=False,
                 return tools.to_datetime(
                     ensure_object(strs),
                     utc=None,
-                    box=False,
+                    box=True,
                     dayfirst=dayfirst,
                     errors='ignore',
                     infer_datetime_format=infer_datetime_format

swyoon · 2018-08-12T10:31:13Z

@mroeschke Thanks for pointing box out. Well, this breaks quite a lot of unit tests.
My first attempt was to keep box=False, and update pandas/core/tools/datetime.py:_convert_listlike_datetimes.
However, I realized that we can't fix this issue with box=False, because what is returned is a Numpy array of datetime64, and it cannot contain the timezone information.
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html#changes-with-numpy-1-11
So, I will try to fix the errors caused by setting box=True.
@gfyoung There is a Pandas sprint at PYCON KR in Seoul, the Republic of Korea on Aug 15th, and I am participating. (It is organized by @scari )
I will continue to work on this issue at the sprint.

scari · 2018-08-13T04:18:51Z

@swyoon I look forward to seeing you! ;)

- use box=True for to_datetime(), and adjust downstream processing to the change. - resolve pandas-dev#22256

gfyoung added Datetime Datetime data dtype IO CSV read_csv, to_csv labels Aug 9, 2018

gfyoung added the Bug label Aug 9, 2018

swyoon mentioned this issue Aug 16, 2018

BUG: fix read_csv to parse timezone correctly #22380

Merged

4 tasks

jreback added this to the Contributions Welcome milestone Aug 16, 2018

swyoon added a commit to swyoon/pandas that referenced this issue Aug 20, 2018

BUG: fix read_csv to parse timezone correctly

62a7ef7

- use box=True for to_datetime(), and adjust downstream processing to the change. - resolve pandas-dev#22256

jreback modified the milestones: Contributions Welcome, 0.24.0 Aug 20, 2018

swyoon added a commit to swyoon/pandas that referenced this issue Aug 20, 2018

BUG: fix read_csv to parse timezone correctly

323dddc

- use box=True for to_datetime(), and adjust downstream processing to the change. - resolve pandas-dev#22256

swyoon added a commit to swyoon/pandas that referenced this issue Aug 20, 2018

BUG: fix read_csv to parse timezone correctly

3a9c093

- use box=True for to_datetime(), and adjust downstream processing to the change. - resolve pandas-dev#22256

jreback closed this as completed in #22380 Aug 22, 2018

mroeschke mentioned this issue Jan 28, 2019

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG or DOC: pd.read_csv with parse_dates does not recognize timezone #22256

BUG or DOC: pd.read_csv with parse_dates does not recognize timezone #22256

swyoon commented Aug 9, 2018

gfyoung commented Aug 9, 2018

jbrockmendel commented Aug 9, 2018

gfyoung commented Aug 10, 2018 •

edited

Loading

mroeschke commented Aug 10, 2018

mroeschke commented Aug 11, 2018

swyoon commented Aug 12, 2018

scari commented Aug 13, 2018

BUG or DOC: pd.read_csv with parse_dates does not recognize timezone #22256

BUG or DOC: pd.read_csv with parse_dates does not recognize timezone #22256

Comments

swyoon commented Aug 9, 2018

Example

Problem description

Output of pd.show_versions()

gfyoung commented Aug 9, 2018

jbrockmendel commented Aug 9, 2018

gfyoung commented Aug 10, 2018 • edited Loading

mroeschke commented Aug 10, 2018

mroeschke commented Aug 11, 2018

swyoon commented Aug 12, 2018

scari commented Aug 13, 2018

Output of `pd.show_versions()`

gfyoung commented Aug 10, 2018 •

edited

Loading