Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv() doesn't parse correctly when usecols and parse_dates are both used #14792

Closed
rubennj opened this issue Dec 3, 2016 · 3 comments
Closed
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@rubennj
Copy link

rubennj commented Dec 3, 2016

Code Sample, a copy-pastable example if possible

In [22]: s = """a,b,c,d,e,f,g,h,i,j
    ...: 2016/09/21,1,1,2,3,4,5,6,7,8"""

In [23]: pd.read_csv(StringIO(s), parse_dates=[0], usecols=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 10 columns):
a    1 non-null datetime64[ns]
b    1 non-null int64
c    1 non-null int64
d    1 non-null int64
e    1 non-null int64
f    1 non-null int64
g    1 non-null int64
h    1 non-null int64
i    1 non-null object    <- !!
j    1 non-null int64
dtypes: datetime64[ns](1), int64(8), object(1)
memory usage: 160.0+ bytes

In [24]: pd.read_csv(StringIO(s), parse_dates=[[0, 1]], usecols=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 9 columns):
a_b    1 non-null object
c      1 non-null int64
d      1 non-null int64
e      1 non-null int64
f      1 non-null int64
g      1 non-null int64
h      1 non-null object    <- !!
i      1 non-null object    <- !!
j      1 non-null int64
dtypes: int64(6), object(3)
memory usage: 152.0+ bytes

Problem description

Since v0.18.1 pd.read_csv() doesn't parse correctly, and it occurs randomly at every run. It occurs only when usecols and parse_dates are both used.

Expected Output

All the columns parsed as int64 and not some randomly as object.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 29.0.1.post20161201
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@sinhrks sinhrks added Bug IO CSV read_csv, to_csv labels Dec 4, 2016
@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Dec 6, 2016
@jorisvandenbossche
Copy link
Member

@rubennj Thanks for the report! (I simplified the example a little bit)

cc @gfyoung

@gfyoung
Copy link
Member

gfyoung commented Dec 8, 2016

@rubennj : Really weird bug. We have tests for parse_dates and usecols here. However, we are evidently dealing with fewer columns than your example does. Quick patch seems to be passing in engine='python', though why that should make a difference bewilders me.

@gfyoung
Copy link
Member

gfyoung commented Dec 25, 2016

Finally got some time to look at this, and your statement about it happening at random was the key. We unfortunately have flaky behavior on the C engine side. When we determine which columns to not convert because they're being used for datetime conversions and usecols is also passed in, the indexing in parse_dates is used with respect to usecols, except that how we do it is unstable.

First, we initialize self.usecols to be a set, which you can see here. When we proceed to index into usecols for parse_dates, we first convert to list, as seen here. That is the flaky part, for if you run this command in the terminal:

python -c "print(list(set(list('abcdefghij'))))"

you see you will get different results.

The reason why the Python engine does not see this issue is because it prunes columns early on and iterates over the column names, which is a list. Hence, it is robust against this flaky set behavior.

gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 25, 2016
@jreback jreback added this to the 0.20.0 milestone Dec 26, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 27, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants