Skip to content

ENH: read_csv parse_dates should use datetime64[us] instead of datetime64[ns] if date out of bound is detected. #31711

@spixi

Description

@spixi

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.read_csv("./Artikelstatus_Augsburg.csv", sep=';', parse_dates=['VON_DTM','BIS_DTM'], infer_datetime_format=True)
print(df.info())
print(df.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14472 entries, 0 to 14471
Data columns (total 4 columns):
Artikel     14472 non-null int64
ATTRIBUT    14472 non-null int64
VON_DTM     14472 non-null datetime64[ns]
BIS_DTM     14472 non-null object
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 452.3+ KB
None
   Artikel  ATTRIBUT             VON_DTM              BIS_DTM
0       17         0 2018-10-01 00:00:00  2018-11-21 16:25:18
1       17         3 2018-11-21 16:25:19  2999-12-31 23:59:59
2       35         0 2018-10-01 00:00:00  2018-11-30 11:38:21
3       35         2 2018-11-30 11:38:22  2018-12-04 17:09:05
4       35         0 2018-12-04 17:09:06  2018-12-05 11:09:24

Problem description

read_csv with parameteter parse_dates uses datetime64[ns] as default. That datatype, however, only covers dates between the years 1678 an 2262. Many systems use hardcoded special dates for -Inf or Inf, including 0001-01-01 00:00:00, 2999-12-31 23:59:59 or 9999-12-31 23:59:59.

The current behavior of read_csv is that an object column instead of a datetime64[us] is returned. Using the date_parser argument leads to a huge performance drop and even solutions like na_values = ['2999-12-31 23:59:59'] do not work.

Instead of an object column, a datetime64[us] column should be returned, when an out-of-bounds date is found. Possible approaches to solve the issue include autodectection by csv.Sniffer, a separate parameter datetime_unit or backtracking.

Expected Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14472 entries, 0 to 14471
Data columns (total 4 columns):
Artikel     14472 non-null int64
ATTRIBUT    14472 non-null int64
VON_DTM     14472 non-null datetime64[ns]
BIS_DTM     14472 non-null datetime64[us]
dtypes: datetime64[ns](1), datetime64[us](1), int64(2)
memory usage: 452.3+ KB
None
   Artikel  ATTRIBUT             VON_DTM              BIS_DTM
0       17         0 2018-10-01 00:00:00  2018-11-21 16:25:18
1       17         3 2018-11-21 16:25:19  2999-12-31 23:59:59
2       35         0 2018-10-01 00:00:00  2018-11-30 11:38:21
3       35         2 2018-11-30 11:38:22  2018-12-04 17:09:05
4       35         0 2018-12-04 17:09:06  2018-12-05 11:09:24

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1052-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: 0.29.14
numpy: 1.16.4
scipy: 1.3.1
pyarrow: 0.15.1
xarray: None
IPython: 7.8.0
sphinx: 2.2.0
patsy: 0.5.0
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.1.1
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.10
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions