Feature Request: skipcols in .read_csv #10882

Closed
pylang opened this Issue Aug 22, 2015 · 18 comments

Comments

Projects
None yet
6 participants

pylang commented Aug 22, 2015

I'd like to read a set of csv files but exclude specific columns. read_csv currently has a usecols keyword, but it requires writing a list of all the columns present. This is a bit tedious and more importantly, not all files have the same columns, so usecols would not work in general cases, whereas a complimentary function would work. Can a skipcols keyword be added to 0.17 that accepts a list of column names and reads all but those columns into a DataFrame? Thanks.

xref #4749
xref #8985
xref #6710

jreback added this to the Next Major Release milestone Aug 22, 2015

@jreback jreback added Prio-low and removed Prio-medium labels Aug 22, 2015

Contributor

terrytangyuan commented Sep 1, 2015

So read_csv() is defined by calling _make_parser_function() which calls _read(). Any instructions would be appreciated. It's a bit confusing to me. @jreback

Contributor

jreback commented Sep 1, 2015

parser is a bit complicated. see how usecols is used.

Contributor

terrytangyuan commented Sep 2, 2015

It looks like the code related to usecols needs to be re-factored before I add things on top of it. I won't be able to re-factor coz I might break a lot of internal things. @jreback

Contributor

TomAugspurger commented Sep 2, 2015

Is there a spot in the code where when know all the columns before starting to parse the rows? If so you can assign usecols=set(all_cols) - set(skipcols) (would need to fixup the ordering afterwards) and go from there.

Contributor

terrytangyuan commented Sep 2, 2015

Yeah I did something similar but stopped due to other code related to
usecols. I'll look into it again. Thanks.
On Sep 2, 2015 8:38 AM, "Tom Augspurger" notifications@github.com wrote:

Is there a spot in the code where when know all the columns before
starting to parse the rows? If so you can take set usecols=set(all_cols)

  • set(skipcols) (would need to fixup the ordering afterwards) and go from
    there.


Reply to this email directly or view it on GitHub
pydata#10882 (comment).

pylang commented Dec 20, 2015

Any progress on this addition? Thanks.

pylang commented Jul 10, 2016

ping @jreback

Contributor

jreback commented Jul 10, 2016

if you submit a PR there will be progress
we have 1700 open issues

pylang commented Jul 10, 2016

many thanks

@jreback jreback modified the milestone: Next Major Release, Someday Sep 5, 2016

pylang commented Sep 25, 2016 edited

In a similar vain, is there a way to read in a subset of rows? In other words, is there a counterpart to the skiprows keyword? For example, this feature is desired:

df = pd.read_csv("bigdata.csv")
df
# Output: Millions of rows

selection = [i for i in range(0, 1000000) if i % 2 == 0]
subset = pd.read_csv("bigdata.csv", use_rows=selection)    # skip all rows except those listed
subset
# Output: only even rows for the first million
Member

gfyoung commented Jan 3, 2017

@pylang : We now accept callable for usecols. Does that help to resolve this issue?

Contributor

jreback commented Jan 3, 2017

sure, maybe an example of doing that in io.rst would be helpful?

jreback added the Docs label Jan 3, 2017

pylang commented Jan 4, 2017

@gfyoung I'm not sure what you have in mind. I am interested in selecting rows. An example would be helpful, thank you.

Member

gfyoung commented Jan 4, 2017

@pylang :

  1. Your original issue was for skipcols though?
  2. skiprows is currently not supported by the C engine. However, we could by all means allow skiprows be a callable like usecols is? How does that sound? Something like:
>>> data = 'a,b,c\n1,2,3\n2,3,4'
>>> read_csv(StringIO(data), skiprows=lambda x: x%2 == 0, engine='python')
a b c
2 3 4

where x is the row number (starting at 0)

Member

gfyoung commented Jan 4, 2017

@jreback : There are examples in the docs to illustrate usecols, but we can also mention that we can use the callable to exclude columns as well. How does that sound?

Contributor

jreback commented Jan 4, 2017

yes that's what i mean, to show using s callable to skipcols

pylang commented Jan 4, 2017

@gfyoung I think your example for skiprows would suffice. And yes you are correct re: skipcols. A similar callable option to filter usecols with an example in the docs would be sufficient imo.

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Jan 4, 2017

@gfyoung gfyoung DOC: Add example of skipcols in read_csv
Illustrate how we can use the "usecols"
argument to skip particular columns.

Closes gh-10882.
bea6137
Member

gfyoung commented Jan 4, 2017

@pylang : #15059 is up to address skiprows. I've hit a roadblock at this point implementing it for the C engine, so any input on that would be appreciated!

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Jan 4, 2017

@gfyoung gfyoung DOC: Add example of skipcols in read_csv
Illustrate how we can use the "usecols"
argument to skip particular columns.

Closes gh-10882.
ea3279f

@jorisvandenbossche jorisvandenbossche added a commit that referenced this issue Jan 4, 2017

@gfyoung @jorisvandenbossche gfyoung + jorisvandenbossche DOC: Add example of skipcols in read_csv (#15052)
Illustrate how we can use the "usecols"
argument to skip particular columns.

Closes gh-10882.
4de5cdc

@jreback jreback added a commit that referenced this issue Jan 14, 2017

@gfyoung @jreback gfyoung + jreback ENH: Accept callable for skiprows in read_csv
Title is self-explanatory.

xref #10882.

Author: gfyoung <gfyoung17@gmail.com>

Closes #15059 from gfyoung/skiprows-callable and squashes the following commits:

d15e3a3 [gfyoung] ENH: Accept callable for skiprows
7ad6c65

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@gfyoung @AnkurDedania gfyoung + AnkurDedania ENH: Accept callable for skiprows in read_csv
Title is self-explanatory.

xref #10882.

Author: gfyoung <gfyoung17@gmail.com>

Closes #15059 from gfyoung/skiprows-callable and squashes the following commits:

d15e3a3 [gfyoung] ENH: Accept callable for skiprows
ecbafbb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment