Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: skipcols in .read_csv #10882

Closed
pylang opened this issue Aug 22, 2015 · 18 comments
Closed

Feature Request: skipcols in .read_csv #10882

pylang opened this issue Aug 22, 2015 · 18 comments
Labels
Milestone

Comments

@pylang
Copy link

pylang commented Aug 22, 2015

I'd like to read a set of csv files but exclude specific columns. read_csv currently has a usecols keyword, but it requires writing a list of all the columns present. This is a bit tedious and more importantly, not all files have the same columns, so usecols would not work in general cases, whereas a complimentary function would work. Can a skipcols keyword be added to 0.17 that accepts a list of column names and reads all but those columns into a DataFrame? Thanks.

xref #4749
xref #8985
xref #6710

@jreback jreback added this to the Next Major Release milestone Aug 22, 2015
@terrytangyuan
Copy link
Contributor

So read_csv() is defined by calling _make_parser_function() which calls _read(). Any instructions would be appreciated. It's a bit confusing to me. @jreback

@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

parser is a bit complicated. see how usecols is used.

@terrytangyuan
Copy link
Contributor

It looks like the code related to usecols needs to be re-factored before I add things on top of it. I won't be able to re-factor coz I might break a lot of internal things. @jreback

@TomAugspurger
Copy link
Contributor

Is there a spot in the code where when know all the columns before starting to parse the rows? If so you can assign usecols=set(all_cols) - set(skipcols) (would need to fixup the ordering afterwards) and go from there.

@terrytangyuan
Copy link
Contributor

Yeah I did something similar but stopped due to other code related to
usecols. I'll look into it again. Thanks.
On Sep 2, 2015 8:38 AM, "Tom Augspurger" notifications@github.com wrote:

Is there a spot in the code where when know all the columns before
starting to parse the rows? If so you can take set usecols=set(all_cols)

  • set(skipcols) (would need to fixup the ordering afterwards) and go from
    there.


Reply to this email directly or view it on GitHub
#10882 (comment).

@pylang
Copy link
Author

pylang commented Dec 20, 2015

Any progress on this addition? Thanks.

@pylang
Copy link
Author

pylang commented Jul 10, 2016

ping @jreback

@jreback
Copy link
Contributor

jreback commented Jul 10, 2016

if you submit a PR there will be progress
we have 1700 open issues

@pylang
Copy link
Author

pylang commented Jul 10, 2016

many thanks

@jorisvandenbossche jorisvandenbossche modified the milestones: Someday, Next Major Release Jul 11, 2016
@jreback jreback modified the milestones: Next Major Release, Someday Sep 5, 2016
@pylang
Copy link
Author

pylang commented Sep 25, 2016

In a similar vain, is there a way to read in a subset of rows? In other words, is there a counterpart to the skiprows keyword? For example, this feature is desired:

df = pd.read_csv("bigdata.csv")
df
# Output: Millions of rows

selection = [i for i in range(0, 1000000) if i % 2 == 0]
subset = pd.read_csv("bigdata.csv", use_rows=selection)    # skip all rows except those listed
subset
# Output: only even rows for the first million

@gfyoung
Copy link
Member

gfyoung commented Jan 3, 2017

@pylang : We now accept callable for usecols. Does that help to resolve this issue?

@jreback
Copy link
Contributor

jreback commented Jan 3, 2017

sure, maybe an example of doing that in io.rst would be helpful?

@jreback jreback added the Docs label Jan 3, 2017
@pylang
Copy link
Author

pylang commented Jan 4, 2017

@gfyoung I'm not sure what you have in mind. I am interested in selecting rows. An example would be helpful, thank you.

@gfyoung
Copy link
Member

gfyoung commented Jan 4, 2017

@pylang :

  1. Your original issue was for skipcols though?
  2. skiprows is currently not supported by the C engine. However, we could by all means allow skiprows be a callable like usecols is? How does that sound? Something like:
>>> data = 'a,b,c\n1,2,3\n2,3,4'
>>> read_csv(StringIO(data), skiprows=lambda x: x%2 == 0, engine='python')
a b c
2 3 4

where x is the row number (starting at 0)

@gfyoung
Copy link
Member

gfyoung commented Jan 4, 2017

@jreback : There are examples in the docs to illustrate usecols, but we can also mention that we can use the callable to exclude columns as well. How does that sound?

@jreback
Copy link
Contributor

jreback commented Jan 4, 2017

yes that's what i mean, to show using s callable to skipcols

@pylang
Copy link
Author

pylang commented Jan 4, 2017

@gfyoung I think your example for skiprows would suffice. And yes you are correct re: skipcols. A similar callable option to filter usecols with an example in the docs would be sufficient imo.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 4, 2017
Illustrate how we can use the "usecols"
argument to skip particular columns.

Closes pandas-devgh-10882.
@gfyoung
Copy link
Member

gfyoung commented Jan 4, 2017

@pylang : #15059 is up to address skiprows. I've hit a roadblock at this point implementing it for the C engine, so any input on that would be appreciated!

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 4, 2017
Illustrate how we can use the "usecols"
argument to skip particular columns.

Closes pandas-devgh-10882.
jorisvandenbossche pushed a commit that referenced this issue Jan 4, 2017
Illustrate how we can use the "usecols"
argument to skip particular columns.

Closes gh-10882.
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, Next Major Release Jan 4, 2017
jreback pushed a commit that referenced this issue Jan 14, 2017
Title is self-explanatory.

xref #10882.

Author: gfyoung <gfyoung17@gmail.com>

Closes #15059 from gfyoung/skiprows-callable and squashes the following commits:

d15e3a3 [gfyoung] ENH: Accept callable for skiprows
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
Title is self-explanatory.

xref pandas-dev#10882.

Author: gfyoung <gfyoung17@gmail.com>

Closes pandas-dev#15059 from gfyoung/skiprows-callable and squashes the following commits:

d15e3a3 [gfyoung] ENH: Accept callable for skiprows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants