Added support for read_fwf() as in R. #952

blais · 2012-03-22T15:10:38Z

I added a simple function like R's read_fwf().

It also made sense to refactor read_csv() and read_table() to redirect to a new function _read() which takes the parser's class as input. I needed this for the FixedWidthFieldParser which needs to override the TextParser in order to provide a custom reader for the fields. Please review; the tests pass.

Also, I made the column spec ("widths")

zero-based, and
inclusive on both the from and to fields (e.g., a colspec of (1,2) has two characters)
I'm not sure that's the best; I think in practice a lot of column format specs start counting at one, but I haven't done a survey or anything, feel free to change it. Maybe we should support both.

takluyver · 2012-03-22T15:20:30Z

pandas/io/parsers.py

+        self.f = f
+        self.colspecs = colspecs
+
+        assert isinstance(colspecs, (tuple, list))


I'd be less strict with this first check - there might be situations where you want to pass a generator rather than a list. I'd do self.colspecs = list(colspecs), and let it handle anything that can be turned into a list.

Good idea. WIll do.

takluyver · 2012-03-22T17:25:16Z

A broader point about the behaviour: how should it handle non-ascii data, especially with variable width encodings like UTF-8?

At present, I think this will do different things according to the version of Python:

In Python 3, it will decode the file using the specified encoding you pass to the file, and count by character across the line (i.e. 字 is one character). If you don't specify the encoding, a platform default encoding is used, and unexpected bytes are replaced with the replacement character.
In Python 2, it will completely ignore the encoding parameter and count bytes, so 字 is three bytes (if the file is UTF-8).

I guess counting bytes is more consistent with what other tools mean by fixed width, but we should decode strings in line with the encoding parameter.

…ents from Thomas Kluyver).

blais · 2012-03-24T17:28:58Z

I've made changes recommend by Thomas.
I've added more tests.
I've done some careful refactorings of the read_* functions for clarity.
Tests pass.
Please pull.

takluyver · 2012-03-24T21:19:37Z

pandas/io/parsers.py

+             verbose=False,
+             delimiter=None,
+             encoding=None):
+    kwds = locals()


I'm not wild on the use of locals() for these, it seems like unnecessary magic. But maybe I'm being overly picky.

Well, the alternative here is to either write out kw=kw for each keyword argument or to have **kwds which makes the signature in IPython less attractive. Not sure what's the best solution-- using locals doesn't strike me as so bad

Having to enumerate all the paramters is both error-prone, makes it difficult to extend the other functions, and it hides the differences between the calls to _read(). I wish there was a method to get just the args, but there isn't.

Indeed. I think PEP 362 is aimed at this sort of thing - you'd use **kwargs, and construct a more meaningful function signature for introspection - but that's still a work in progress.

Ideally it would be like LISP and a :as variable could be assigned to the set of kwargs. But it's Python. Whatever. We'll eventually end up with LISP again.

takluyver · 2012-03-24T21:23:56Z

OK, I've made a couple more minor comments. I'll like someone with a fresh pair of eyes look over this - pinging @wesm and @adamklein.

wesm · 2012-03-24T21:39:54Z

This all looks pretty good to me. I'll take it for a spin in the next couple of days and merge it in. Thanks for the PR!

wesm · 2012-04-02T16:29:16Z

Sweet thanks all for this, very nice work, merged it in.

@blais you'll want to fetch pydata/master now and git reset --hard upstream/master (where upstream is the remote for pydata/pandas) to fix up your master

Added basic support for fixed-width fields, as per R's read_fwf.

47e6b5c

takluyver reviewed Mar 22, 2012
View reviewed changes

blais added 2 commits March 24, 2012 10:56

Merge https://github.com/pydata/pandas

42309d1

Fixed various issues in read_fwf() implementation (applied merge comm…

d044688

…ents from Thomas Kluyver).

takluyver reviewed Mar 24, 2012
View reviewed changes

Removed old comment.

b2e08b5

wesm mentioned this pull request Mar 30, 2012

Implement fixed-width file reader #920

Closed

wesm merged commit b2e08b5 into pandas-dev:master Apr 2, 2012

wesm mentioned this pull request Apr 2, 2012

Allow read_csv to take URLs #970

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for read_fwf() as in R. #952

Added support for read_fwf() as in R. #952

blais commented Mar 22, 2012

takluyver Mar 22, 2012

blais Mar 22, 2012

blais Mar 24, 2012

takluyver commented Mar 22, 2012

blais commented Mar 24, 2012

takluyver Mar 24, 2012

wesm Mar 24, 2012

blais Mar 24, 2012

takluyver Mar 24, 2012

blais Mar 24, 2012

takluyver commented Mar 24, 2012

wesm commented Mar 24, 2012

wesm commented Apr 2, 2012

Added support for read_fwf() as in R. #952

Added support for read_fwf() as in R. #952

Conversation

blais commented Mar 22, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takluyver commented Mar 22, 2012

blais commented Mar 24, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takluyver commented Mar 24, 2012

wesm commented Mar 24, 2012

wesm commented Apr 2, 2012