read_csv() doesn't parse correctly when `usecols` and `parse_dates` are both used #14792

rubennj · 2016-12-03T17:19:21Z

Code Sample, a copy-pastable example if possible

In [22]: s = """a,b,c,d,e,f,g,h,i,j
    ...: 2016/09/21,1,1,2,3,4,5,6,7,8"""

In [23]: pd.read_csv(StringIO(s), parse_dates=[0], usecols=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 10 columns):
a    1 non-null datetime64[ns]
b    1 non-null int64
c    1 non-null int64
d    1 non-null int64
e    1 non-null int64
f    1 non-null int64
g    1 non-null int64
h    1 non-null int64
i    1 non-null object    <- !!
j    1 non-null int64
dtypes: datetime64[ns](1), int64(8), object(1)
memory usage: 160.0+ bytes

In [24]: pd.read_csv(StringIO(s), parse_dates=[[0, 1]], usecols=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 9 columns):
a_b    1 non-null object
c      1 non-null int64
d      1 non-null int64
e      1 non-null int64
f      1 non-null int64
g      1 non-null int64
h      1 non-null object    <- !!
i      1 non-null object    <- !!
j      1 non-null int64
dtypes: int64(6), object(3)
memory usage: 152.0+ bytes

Problem description

Since v0.18.1 pd.read_csv() doesn't parse correctly, and it occurs randomly at every run. It occurs only when usecols and parse_dates are both used.

Expected Output

All the columns parsed as int64 and not some randomly as object.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 29.0.1.post20161201
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-12-06T11:51:19Z

@rubennj Thanks for the report! (I simplified the example a little bit)

cc @gfyoung

gfyoung · 2016-12-08T04:45:16Z

@rubennj : Really weird bug. We have tests for parse_dates and usecols here. However, we are evidently dealing with fewer columns than your example does. Quick patch seems to be passing in engine='python', though why that should make a difference bewilders me.

gfyoung · 2016-12-25T09:04:33Z

Finally got some time to look at this, and your statement about it happening at random was the key. We unfortunately have flaky behavior on the C engine side. When we determine which columns to not convert because they're being used for datetime conversions and usecols is also passed in, the indexing in parse_dates is used with respect to usecols, except that how we do it is unstable.

First, we initialize self.usecols to be a set, which you can see here. When we proceed to index into usecols for parse_dates, we first convert to list, as seen here. That is the flaky part, for if you run this command in the terminal:

python -c "print(list(set(list('abcdefghij'))))"

you see you will get different results.

The reason why the Python engine does not see this issue is because it prunes columns early on and iterates over the column names, which is a list. Hence, it is robust against this flaky set behavior.

Closes pandas-devgh-14792.

sinhrks added Bug IO CSV read_csv, to_csv labels Dec 4, 2016

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Dec 6, 2016

gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 25, 2016

BUG: Avoid flaky usecols set in C engine

30bae02

Closes pandas-devgh-14792.

gfyoung mentioned this issue Dec 25, 2016

BUG: Avoid flaky usecols set in C engine #14984

Closed

jreback added this to the 0.20.0 milestone Dec 26, 2016

gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 27, 2016

BUG: Avoid flaky usecols set in C engine

6ac5814

Closes pandas-devgh-14792.

gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 27, 2016

BUG: Avoid flaky usecols set in C engine

82cf55b

Closes pandas-devgh-14792.

jreback closed this as completed in a42a015 Dec 30, 2016

quazgar mentioned this issue Apr 21, 2020

BUG: read_csv with both names and parse_date raises 'NoneType' TypeError #33699

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv() doesn't parse correctly when `usecols` and `parse_dates` are both used #14792

read_csv() doesn't parse correctly when `usecols` and `parse_dates` are both used #14792

rubennj commented Dec 3, 2016 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Dec 6, 2016

gfyoung commented Dec 8, 2016 •

edited

Loading

gfyoung commented Dec 25, 2016 •

edited

Loading

read_csv() doesn't parse correctly when usecols and parse_dates are both used #14792

read_csv() doesn't parse correctly when usecols and parse_dates are both used #14792

Comments

rubennj commented Dec 3, 2016 • edited by jorisvandenbossche Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jorisvandenbossche commented Dec 6, 2016

gfyoung commented Dec 8, 2016 • edited Loading

gfyoung commented Dec 25, 2016 • edited Loading

read_csv() doesn't parse correctly when `usecols` and `parse_dates` are both used #14792

read_csv() doesn't parse correctly when `usecols` and `parse_dates` are both used #14792

rubennj commented Dec 3, 2016 •

edited by jorisvandenbossche

Loading

Output of `pd.show_versions()`

gfyoung commented Dec 8, 2016 •

edited

Loading

gfyoung commented Dec 25, 2016 •

edited

Loading