Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv with names, usecols and parse_dates #9755

Closed
brechtm opened this issue Mar 31, 2015 · 2 comments
Closed

read_csv with names, usecols and parse_dates #9755

brechtm opened this issue Mar 31, 2015 · 2 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@brechtm
Copy link

brechtm commented Mar 31, 2015

xref #12203

The arrays passed to the date_parser function is different when names and use_cols are specified to limit the number of parsed columns.

When running the example code below, the date_parser function receives two arguments, one array with '20140101' strings, and one array with integers. The default date_parser fails to process this input.

When assigning an empty list to DROPPED_COLUMNS (so that all columns are parsed), the second array contains strings instead of integers, and the datetimes are parsed correctly.

The problem doesn't occur with engine='python'. I haven't tested the influence of the header and index_cols options.

Python script:

from __future__ import print_function, division
import pandas as pd

CSV = '2014.csv'

DROPPED_COLUMNS = ['NCDC', 'I', 'QCP']
# DROPPED_COLUMNS = []

with open(CSV) as csv:
    csv.readline()
    column_names = csv.readline().split()
    used_columns = [i for i, column_name in enumerate(column_names)
                    if column_name not in DROPPED_COLUMNS]
    used_col_names = [column_name for i, column_name in enumerate(column_names)
                      if i in used_columns]
    parse_dates = [[i for i, column_name in enumerate(used_col_names)
                    if column_name in ('Date', 'HrMn')]]
    print(parse_dates)

    data = pd.read_csv(csv, header=None, names=used_col_names, index_col=False, engine='python',
                       parse_dates=parse_dates, usecols=used_columns)
    print(data)

Contents of 2014.csv:

Identification                          TEMP
USAF   NCDC  Date     HrMn I Type  QCP  Temp   Q
062693,99999,20140101,0025,4,FM-15,    ,   7.0,1,
062693,99999,20140101,0055,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0125,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0155,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0225,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0255,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0325,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0355,4,FM-15,    ,   6.0,1,
@jorisvandenbossche
Copy link
Member

I can confirm this with master.

With as simplified example:

s = """0,1,20140101,0900,4
0,1,20140102,1000,4"""

# using all cols -> no problem
pd.read_csv(StringIO(s), header=None, names=list('abcde'), parse_dates=[[2,3]], engine='c')

# using certain cols -> problem
pd.read_csv(StringIO(s), header=None, names=list('acd'), usecols=[0,2,3], parse_dates=[[1,2]], engine='c')

# but not with python engine
pd.read_csv(StringIO(s), header=None, names=list('acd'), usecols=[0,2,3], parse_dates=[[1,2]], engine='python')

But it seems the header/names do not matter for reproducing it. So this also reproduces it:

s = """a,b,c,d,e
0,1,20140101,0900,4
0,1,20140102,1000,4"""

# using all cols -> no problem
pd.read_csv(StringIO(s), parse_dates=[[2,3]], engine='c')

# using certain cols -> problem
pd.read_csv(StringIO(s), usecols=[0,2,3], parse_dates=[[1,2]], engine='c')

# but not with python engine
pd.read_csv(StringIO(s), usecols=[0,2,3], parse_dates=[[1,2]], engine='python')

gives:

# using all cols -> no problem
In [16]: pd.read_csv(StringIO(s), parse_dates=[[2,3]], engine='c')
Out[16]:
                  c_d  a  b  e
0 2014-01-01 09:00:00  0  1  4
1 2014-01-02 10:00:00  0  1  4

# using certain cols -> problem
In [18]: pd.read_csv(StringIO(s), usecols=[0,2,3], parse_dates=[[1,2]], engine='c')
Out[18]:
             c_d  a
0   20140101 900  0
1  20140102 1000  0

# but not with python engine
In [20]: pd.read_csv(StringIO(s), usecols=[0,2,3], parse_dates=[[1,2]], engine='python')
Out[20]:
                  c_d  a
0 2014-01-01 09:00:00  0
1 2014-01-02 10:00:00  0

@jorisvandenbossche jorisvandenbossche added IO CSV read_csv, to_csv Bug labels Apr 1, 2015
@jorisvandenbossche
Copy link
Member

@brechtm Thanks for reporting! (interested in looking into it?)

cc @mdmueller @selasley

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 6, 2016
Fixes bug in processing 'parse_dates' with the C engine
in which the wrong indices (those of the filtered column
names) were being used to determine the date columns to
not be dtype-parsed by the C engine. The correct indices
are those of the original (unfiltered) column names, as
they are used later on in the actual data processing.

Closes pandas-devgh-9755.
@jreback jreback closed this as completed in c6c201e Apr 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants