-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
df = pd.read_csv("./Artikelstatus_Augsburg.csv", sep=';', parse_dates=['VON_DTM','BIS_DTM'], infer_datetime_format=True)
print(df.info())
print(df.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14472 entries, 0 to 14471
Data columns (total 4 columns):
Artikel 14472 non-null int64
ATTRIBUT 14472 non-null int64
VON_DTM 14472 non-null datetime64[ns]
BIS_DTM 14472 non-null object
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 452.3+ KB
None
Artikel ATTRIBUT VON_DTM BIS_DTM
0 17 0 2018-10-01 00:00:00 2018-11-21 16:25:18
1 17 3 2018-11-21 16:25:19 2999-12-31 23:59:59
2 35 0 2018-10-01 00:00:00 2018-11-30 11:38:21
3 35 2 2018-11-30 11:38:22 2018-12-04 17:09:05
4 35 0 2018-12-04 17:09:06 2018-12-05 11:09:24
Problem description
read_csv
with parameteter parse_dates
uses datetime64[ns]
as default. That datatype, however, only covers dates between the years 1678 an 2262. Many systems use hardcoded special dates for -Inf
or Inf
, including 0001-01-01 00:00:00
, 2999-12-31 23:59:59
or 9999-12-31 23:59:59
.
The current behavior of read_csv
is that an object
column instead of a datetime64[us]
is returned. Using the date_parser
argument leads to a huge performance drop and even solutions like na_values = ['2999-12-31 23:59:59']
do not work.
Instead of an object
column, a datetime64[us]
column should be returned, when an out-of-bounds date is found. Possible approaches to solve the issue include autodectection by csv.Sniffer
, a separate parameter datetime_unit
or backtracking.
Expected Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14472 entries, 0 to 14471
Data columns (total 4 columns):
Artikel 14472 non-null int64
ATTRIBUT 14472 non-null int64
VON_DTM 14472 non-null datetime64[ns]
BIS_DTM 14472 non-null datetime64[us]
dtypes: datetime64[ns](1), datetime64[us](1), int64(2)
memory usage: 452.3+ KB
None
Artikel ATTRIBUT VON_DTM BIS_DTM
0 17 0 2018-10-01 00:00:00 2018-11-21 16:25:18
1 17 3 2018-11-21 16:25:19 2999-12-31 23:59:59
2 35 0 2018-10-01 00:00:00 2018-11-30 11:38:21
3 35 2 2018-11-30 11:38:22 2018-12-04 17:09:05
4 35 0 2018-12-04 17:09:06 2018-12-05 11:09:24
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1052-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: 0.29.14
numpy: 1.16.4
scipy: 1.3.1
pyarrow: 0.15.1
xarray: None
IPython: 7.8.0
sphinx: 2.2.0
patsy: 0.5.0
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.1.1
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.10
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None