read_csv with names, usecols and parse_dates #9755

brechtm · 2015-03-31T08:01:57Z

The arrays passed to the date_parser function is different when names and use_cols are specified to limit the number of parsed columns.

When running the example code below, the date_parser function receives two arguments, one array with '20140101' strings, and one array with integers. The default date_parser fails to process this input.

When assigning an empty list to DROPPED_COLUMNS (so that all columns are parsed), the second array contains strings instead of integers, and the datetimes are parsed correctly.

The problem doesn't occur with engine='python'. I haven't tested the influence of the header and index_cols options.

Python script:

from __future__ import print_function, division
import pandas as pd

CSV = '2014.csv'

DROPPED_COLUMNS = ['NCDC', 'I', 'QCP']
# DROPPED_COLUMNS = []

with open(CSV) as csv:
    csv.readline()
    column_names = csv.readline().split()
    used_columns = [i for i, column_name in enumerate(column_names)
                    if column_name not in DROPPED_COLUMNS]
    used_col_names = [column_name for i, column_name in enumerate(column_names)
                      if i in used_columns]
    parse_dates = [[i for i, column_name in enumerate(used_col_names)
                    if column_name in ('Date', 'HrMn')]]
    print(parse_dates)

    data = pd.read_csv(csv, header=None, names=used_col_names, index_col=False, engine='python',
                       parse_dates=parse_dates, usecols=used_columns)
    print(data)

Contents of 2014.csv:

Identification                          TEMP
USAF   NCDC  Date     HrMn I Type  QCP  Temp   Q
062693,99999,20140101,0025,4,FM-15,    ,   7.0,1,
062693,99999,20140101,0055,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0125,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0155,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0225,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0255,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0325,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0355,4,FM-15,    ,   6.0,1,

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2015-04-01T08:31:22Z

I can confirm this with master.

With as simplified example:

s = """0,1,20140101,0900,4
0,1,20140102,1000,4"""

# using all cols -> no problem
pd.read_csv(StringIO(s), header=None, names=list('abcde'), parse_dates=[[2,3]], engine='c')

# using certain cols -> problem
pd.read_csv(StringIO(s), header=None, names=list('acd'), usecols=[0,2,3], parse_dates=[[1,2]], engine='c')

# but not with python engine
pd.read_csv(StringIO(s), header=None, names=list('acd'), usecols=[0,2,3], parse_dates=[[1,2]], engine='python')

But it seems the header/names do not matter for reproducing it. So this also reproduces it:

s = """a,b,c,d,e
0,1,20140101,0900,4
0,1,20140102,1000,4"""

# using all cols -> no problem
pd.read_csv(StringIO(s), parse_dates=[[2,3]], engine='c')

# using certain cols -> problem
pd.read_csv(StringIO(s), usecols=[0,2,3], parse_dates=[[1,2]], engine='c')

# but not with python engine
pd.read_csv(StringIO(s), usecols=[0,2,3], parse_dates=[[1,2]], engine='python')

gives:

# using all cols -> no problem
In [16]: pd.read_csv(StringIO(s), parse_dates=[[2,3]], engine='c')
Out[16]:
                  c_d  a  b  e
0 2014-01-01 09:00:00  0  1  4
1 2014-01-02 10:00:00  0  1  4

# using certain cols -> problem
In [18]: pd.read_csv(StringIO(s), usecols=[0,2,3], parse_dates=[[1,2]], engine='c')
Out[18]:
             c_d  a
0   20140101 900  0
1  20140102 1000  0

# but not with python engine
In [20]: pd.read_csv(StringIO(s), usecols=[0,2,3], parse_dates=[[1,2]], engine='python')
Out[20]:
                  c_d  a
0 2014-01-01 09:00:00  0
1 2014-01-02 10:00:00  0

jorisvandenbossche · 2015-04-01T08:37:21Z

@brechtm Thanks for reporting! (interested in looking into it?)

cc @mdmueller @selasley

Fixes bug in processing 'parse_dates' with the C engine in which the wrong indices (those of the filtered column names) were being used to determine the date columns to not be dtype-parsed by the C engine. The correct indices are those of the original (unfiltered) column names, as they are used later on in the actual data processing. Closes pandas-devgh-9755.

jorisvandenbossche added IO CSV read_csv, to_csv Bug labels Apr 1, 2015

jreback mentioned this issue Feb 2, 2016

io.read_csv: ValueError due to long lines #12203

Closed

jreback added Difficulty Intermediate labels Feb 3, 2016

jreback added this to the Next Major Release milestone Feb 3, 2016

VelizarVESSELINOV mentioned this issue Feb 29, 2016

usecols is not respected when the file is empty #12493

Closed

gfyoung mentioned this issue Mar 2, 2016

BUG: Fix parse_dates processing with usecols and C engine #12512

Closed

jreback modified the milestones: 0.18.1, Next Major Release Mar 8, 2016

jreback closed this as completed in c6c201e Apr 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv with names, usecols and parse_dates #9755

read_csv with names, usecols and parse_dates #9755

brechtm commented Mar 31, 2015

jorisvandenbossche commented Apr 1, 2015

jorisvandenbossche commented Apr 1, 2015

read_csv with names, usecols and parse_dates #9755

read_csv with names, usecols and parse_dates #9755

Comments

brechtm commented Mar 31, 2015

jorisvandenbossche commented Apr 1, 2015

jorisvandenbossche commented Apr 1, 2015