Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for read_fwf() as in R. #952

Merged
merged 4 commits into from
Apr 2, 2012
Merged

Conversation

blais
Copy link
Contributor

@blais blais commented Mar 22, 2012

I added a simple function like R's read_fwf().

It also made sense to refactor read_csv() and read_table() to redirect to a new function _read() which takes the parser's class as input. I needed this for the FixedWidthFieldParser which needs to override the TextParser in order to provide a custom reader for the fields. Please review; the tests pass.

Also, I made the column spec ("widths")

  • zero-based, and
  • inclusive on both the from and to fields (e.g., a colspec of (1,2) has two characters)
    I'm not sure that's the best; I think in practice a lot of column format specs start counting at one, but I haven't done a survey or anything, feel free to change it. Maybe we should support both.

self.f = f
self.colspecs = colspecs

assert isinstance(colspecs, (tuple, list))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be less strict with this first check - there might be situations where you want to pass a generator rather than a list. I'd do self.colspecs = list(colspecs), and let it handle anything that can be turned into a list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. WIll do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@takluyver
Copy link
Contributor

A broader point about the behaviour: how should it handle non-ascii data, especially with variable width encodings like UTF-8?

At present, I think this will do different things according to the version of Python:

  • In Python 3, it will decode the file using the specified encoding you pass to the file, and count by character across the line (i.e. is one character). If you don't specify the encoding, a platform default encoding is used, and unexpected bytes are replaced with the replacement character.
  • In Python 2, it will completely ignore the encoding parameter and count bytes, so is three bytes (if the file is UTF-8).

I guess counting bytes is more consistent with what other tools mean by fixed width, but we should decode strings in line with the encoding parameter.

@blais
Copy link
Contributor Author

blais commented Mar 24, 2012

I've made changes recommend by Thomas.
I've added more tests.
I've done some careful refactorings of the read_* functions for clarity.
Tests pass.
Please pull.

verbose=False,
delimiter=None,
encoding=None):
kwds = locals()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not wild on the use of locals() for these, it seems like unnecessary magic. But maybe I'm being overly picky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the alternative here is to either write out kw=kw for each keyword argument or to have **kwds which makes the signature in IPython less attractive. Not sure what's the best solution-- using locals doesn't strike me as so bad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to enumerate all the paramters is both error-prone, makes it difficult to extend the other functions, and it hides the differences between the calls to _read(). I wish there was a method to get just the args, but there isn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I think PEP 362 is aimed at this sort of thing - you'd use **kwargs, and construct a more meaningful function signature for introspection - but that's still a work in progress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it would be like LISP and a :as variable could be assigned to the set of kwargs. But it's Python. Whatever. We'll eventually end up with LISP again.

@takluyver
Copy link
Contributor

OK, I've made a couple more minor comments. I'll like someone with a fresh pair of eyes look over this - pinging @wesm and @adamklein.

@wesm
Copy link
Member

wesm commented Mar 24, 2012

This all looks pretty good to me. I'll take it for a spin in the next couple of days and merge it in. Thanks for the PR!

@wesm wesm merged commit b2e08b5 into pandas-dev:master Apr 2, 2012
@wesm
Copy link
Member

wesm commented Apr 2, 2012

Sweet thanks all for this, very nice work, merged it in.

@blais you'll want to fetch pydata/master now and git reset --hard upstream/master (where upstream is the remote for pydata/pandas) to fix up your master

@wesm wesm mentioned this pull request Apr 2, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants