should start/next/done iterate over rows or cols? #48

HarlanH · 2012-08-08T12:40:33Z

Currently, start/next/done iterates over columns of an AbstractDataFrame. It seems to me that they should instead iterate over rows, as do these functions for DataStreams. The next() return value should presumably be a 1-row SubDataFrame.

Are there any current functions that depend on the current behavior?

@tshort , github says I should blame you for this. :)

johnmyleswhite · 2012-08-08T12:48:56Z

Having iteration for AbstractDataFrames work over rows does seem like the right approach, but there's definitely a use for easy-to-use tools to perform column iteration. So much behavior on DataFrames has a design pattern where you perform operations column-wise and then combine results. The mean of a DataFrame, for example, is a mapping where you aggregate the means of each column. I imagine all of that can achieved with map, but maybe there's something customized to be added here eventually.

HarlanH · 2012-08-08T12:53:26Z

Yes, I agree. colwise is a good start, of course. Maybe we need another
wrapper type that implicitly flips the axes, so you can do [f(col) for col in flip(df)] or something. Or maybe it should be called itcols(). Dunno.

It'd be good to be able to easily iterate over a subset of columns too, of
course. Especially as we're currently sans row names.

On Wed, Aug 8, 2012 at 8:48 AM, John Myles White
notifications@github.comwrote:

Having iteration for AbstractDataFrames work over rows does seem like the
right approach, but there's definitely a use for easy-to-use tools to
perform column iteration. So much behavior on DataFrames has a design
pattern where you perform operations column-wise and then combine results.
The mean of a DataFrame, for example, is a mapping where you aggregate the
means of each column. I imagine all of that can achieved with map, but
maybe there's something customized to be added here eventually.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/48#issuecomment-7582594.

tshort · 2012-08-08T13:28:44Z

This one is debatable. I don't think there are any functions that depend on this. It's easy enough to fix if I'm forgetting something.

The reason I defaulted to iterating over columns is to keep a DataFrame in line with Julia's Associative methods and also sort-of to match the equivalent of R's lapply. Which way you normally iterate over really depends on the type of analysis you have. For me, I'd say iterating over columns is more common (especially for zoo-like operations). Array comprehension syntax is one I'd rather have work over columns.

Doing anything with a subset of columns is easy: just index it. That operation is cheap.

HarlanH · 2012-08-08T13:37:12Z

Hm. I just don't think of a DataFrame as an Associative structure as
strongly as you do. The types of data I work with on a day-to-day basis
simply don't make sense to iterate over columnwise, or at least not all
of the columns. I tend to more frequently deal with relational data, not
sequential data. I also don't like the behavior of lappy on data.frames,
at all. Hadley's adply(df, 1, ...) is the more useful (to me) operation,
although I don't like having to give axes by numerical index.

I think something like the solution outlined above, where the default
iterator over a DF is row-wise (for my use cases), but there's some
relatively easy way to change it to col-wise (for your use cases), is the
way we should go.

On Wed, Aug 8, 2012 at 9:28 AM, Tom Short notifications@github.com wrote:

This one is debatable. I don't think there are any functions that depend
on this. It's easy enough to fix if I'm forgetting something.

The reason I defaulted to iterating over columns is to keep a DataFrame in
line with Julia's Associative methods and also sort-of to match the
equivalent of R's lapply. Which way you normally iterate over really
depends on the type of analysis you have. For me, I'd say iterating over
columns is more common (especially for zoo-like operations). Array
comprehension syntax is one I'd rather have work over columns.

Doing anything with a subset of columns is easy: just index it. That
operation is cheap.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/48#issuecomment-7583440.

tshort · 2012-08-08T13:40:36Z

I'm fine with that.

StefanKarpinski · 2012-08-08T15:40:49Z

One thing to consider: iteration and indexing should probably be consistent. That is doing for x in df should probably give you the same thing as successively indexing into df. You can provide other iteration schemes with a wrapper. E.g. an EachRow type that you can use as for r in EachRow(df) or possibly for r in each_row(df) (see EachLine/each_line for comparison).

HarlanH · 2012-08-08T16:02:54Z

Hm, perhaps. We do have reference dispatch set up so that df[1:2] gets the first two columns, but df[1:2,1] gets the first two rows of the first column. So a single index into a df does return one or more columns.

The each_line/EachLine technique does seem applicable here.

So, we keep iteration of dfs as columns by default for DataFrames (and by rows necessarily for DataStreams), and define each_row/EachRow for DataFrames? It feels like a bit more typing for me, and not entirely consistent, but reasonable.

StefanKarpinski · 2012-08-08T17:13:12Z

I dunno. Just giving points of reference. This is your call :-)

HarlanH · 2012-08-08T18:09:56Z

A reasonable compromise allowing for future changes would be to define both each_row and each_col, make one of them a no-op, and declare in the documentation that the default (unwrapped) iterator behavior might change in the future, and the safe thing to do would be to use the appropriate function.

johnmyleswhite · 2012-08-08T18:20:53Z

Ok, I find consistency very compelling in general.

But I think there's a very strong argument here for iteration over rows rather than entries or columns: rows are the smallest unit of a DataFrame whose values are of a consistent type. Column A may have a different type from Column B, so you can't blindly loop over the columns and apply consistent processing. But each row consists of exactly the same number of columns and the types of entry at location I in each row is always homogeneous. So you can blindly treats rows as equivalent.

Also, in statistical theory the row is almost always the fundamental object. That's the thing you usually assume is IID: the columns are almost never independent from each other and the entries are definitely not IID in any interesting model. (If they were, you'd have a vector, not a matrix or DataFrame.)

That leaves me with the idea that we should do iteration by rows or that there should be no iteration at all using start, next or done. I'm totally happy with that later approach: create EachRow and EachCol methods or insist that people use a DataStream, not a DataFrame.

StefanKarpinski · 2012-08-08T18:25:03Z

That's a pretty solid argument, John.

johnmyleswhite · 2013-01-02T14:45:58Z

We should make some decisions here.

HarlanH · 2013-01-02T14:51:55Z

I remain OK with the final consensus suggestion, of for row in EachRow(df) or for col in EachCol(df), but for x in df throwing an error.

DataVector and DataArray iteration should definitely follow standard Julia vector and array behavior...

johnmyleswhite · 2013-01-02T14:52:46Z

Ok. I'll write some tests for that. I believe those already work.

johnmyleswhite · 2013-01-02T15:11:07Z

Turns out that EachRow and EachCol don't exist. I'm a little hesitant to create them, because they'd occupy a space that should arguably be left to Base to fill. I'll make drafts for now, but we should discuss this issue on the main Julia mailing list.

johnmyleswhite · 2013-01-02T15:22:56Z

About to e-mail the main Julia list about iterating over rows and columns. For now, this closed by a137666

johnmyleswhite closed this as completed Jan 2, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should start/next/done iterate over rows or cols? #48

should start/next/done iterate over rows or cols? #48

HarlanH commented Aug 8, 2012

johnmyleswhite commented Aug 8, 2012

HarlanH commented Aug 8, 2012

tshort commented Aug 8, 2012

HarlanH commented Aug 8, 2012

tshort commented Aug 8, 2012

StefanKarpinski commented Aug 8, 2012

HarlanH commented Aug 8, 2012

StefanKarpinski commented Aug 8, 2012

HarlanH commented Aug 8, 2012

johnmyleswhite commented Aug 8, 2012

StefanKarpinski commented Aug 8, 2012

johnmyleswhite commented Jan 2, 2013

HarlanH commented Jan 2, 2013

johnmyleswhite commented Jan 2, 2013

johnmyleswhite commented Jan 2, 2013

johnmyleswhite commented Jan 2, 2013

should start/next/done iterate over rows or cols? #48

should start/next/done iterate over rows or cols? #48

Comments

HarlanH commented Aug 8, 2012

johnmyleswhite commented Aug 8, 2012

HarlanH commented Aug 8, 2012

tshort commented Aug 8, 2012

HarlanH commented Aug 8, 2012

tshort commented Aug 8, 2012

StefanKarpinski commented Aug 8, 2012

HarlanH commented Aug 8, 2012

StefanKarpinski commented Aug 8, 2012

HarlanH commented Aug 8, 2012

johnmyleswhite commented Aug 8, 2012

StefanKarpinski commented Aug 8, 2012

johnmyleswhite commented Jan 2, 2013

HarlanH commented Jan 2, 2013

johnmyleswhite commented Jan 2, 2013

johnmyleswhite commented Jan 2, 2013

johnmyleswhite commented Jan 2, 2013